Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015.

Slides:



Advertisements
Similar presentations
Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.
Advertisements

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
Network Topologies.
Proximity service Main idea – provide “glue” between experiments and sonar topology – mainly map sonars to storages and vice versa – determine existing.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
INFSO-RI Enabling Grids for E-sciencE Incident Response Policies and Procedures Carlos Fuentes
Operational Security Working Group Topics Incident Handling Process –OSG Document Review & Comments:
Use Cases. Summary Define and understand slow transfers – Identify weak links, narrow down the source – Understand what perfSONAR measurements mean wrt.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE II - Network Service Level Agreement (SLA) Establishment EGEE’07 Mary Grammatikou.
Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) WLCG GDB, CERN 8 July 2015.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP35, Liverpool 11 Sep 2015.
The production deployment of IPv6 on WLCG David Kelsey (STFC-RAL) CHEP2015, OIST, Okinawa 16 Apr 2015.
Connect communicate collaborate perfSONAR MDM updates: New interface, new weathermap, towards a complete interoperability Domenico Vicinanza perfSONAR.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 8 th April 2015.
New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik perfSONAR Operations Sub-group 22 nd October 2014.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
NORDUnet Nordic Infrastructure for Research & Education Workshop Introduction - Finding the Match Lars Fischer LHCONE Workshop CERN, December 2012.
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
WLCG Network and Transfer Metrics WG After One Year Shawn McKee, Marian Babik GDB 4 th November
HEPiX IPv6 Working Group David Kelsey GDB, CERN 11 Jan 2012.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
The new FTS – proposal FTS status. EMI INFSO-RI /05/ FTS /05/ /05/ Bugs fixed – Support an SE publishing more than.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.
WLCG Latency Mesh Comments + – It can be done, works consistently and already provides useful data – Latency mesh stable, once configured sonars are stable.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.
WLCG and IPv6 David Kelsey (STFC-RAL) LHCOPN/LHCONE, Rome 28 Apr 2014.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
David Foster, CERN GDB Meeting April 2008 GDB Meeting April 2008 LHCOPN Status and Plans A lot more detail at:
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.
The HEPiX IPv6 Working Group David Kelsey (STFC-RAL) EGI OMB 19 Dec 2013.
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
Data Distribution Performance Hironori Ito Brookhaven National Laboratory.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.
Authentication and Authorisation for Research and Collaboration Taipei - Taiwan Mechanisms of Interfederation 13th March 2016 Alessandra.
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
WLCG IPv6 deployment strategy
Shawn McKee, Marian Babik for the
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
WLCG Network Discussion
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
LHCOPN/LHCONE perfSONAR Update
Networking for the Future of Science
LHCOPN/LHCONE perfSONAR Update
Update from the HEPiX IPv6 WG
Alerting/Notifications (MadAlert)
LHCONE perfSONAR: Status and Plans
Network Monitoring Update: June 14, 2017 Shawn McKee
Grid Computing 6th FCPPL Workshop
IPv6 update Duncan Rand Imperial College London
Presentation transcript:

Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015

Next Meetings/Events – 8 Apr (CHEP), 6 May, 3 June, 8 July, 2 Sept - all at 4pm CEST – HEPiX and CHEP Input to use cases document still missing ! Today ● perfSONAR status ● Mesh configuration changes ● Operations status ● esmond status/plans ● Network performance incidents/degradations ● Integration projects ● Experiment’s interface to perfSONAR Outline Network Monitoring and Metrics WG Meeting

3 perfSONAR Status Network Monitoring and Metrics WG Meeting

Disabled all inter-cloud meshes and full mesh bandwidth Introduced full mesh latency mesh – with LHCOPN hosts for now Regional meshes unchanged Introduced Latin America mesh – lead by Renato and Pedro Introduced Dual-stack mesh – lead by Duncan – IPv6 tests for nodes that are dual stack – Plan is to progressively add nodes that support IPv6 – Test frequency can be lowered if necessary Configuration interface at – You need to be registered in OIM and authorized to access – Please contact me or Shawn if you have issues Network Monitoring and Metrics WG Meeting 4 Mesh Configuration changes

For LHCOPN/LHCONE started to investigate if sonars are consistently delivering metrics – Added new checks to psomd.grid.iu.edu to query freshness of local MAs and identify missing links – Non-critical tests that check high level functionality Results used to calculate overall completeness Detect links that are not tested – Plotting missing links to see the patterns - monitoring.cern.ch/perfsonar-debug/ monitoring.cern.ch/perfsonar-debug/ Several issues were found thanks to this – Dual-stack sonars were not tested in any way if connection to IPv6 failed – resolved by forcing all tests to be ipv4 only – Latency nodes gradually removing tests until almost nothing is tested – regular testing almost never recovers (requires reboot) – Regular testing not running on some nodes – Bandwidth nodes in better shape, but prone to misconfigurations (NDGF) – Current status information from sonars not sufficient to understand if node is healthy – limited information on status of regular testing and esmond Network Monitoring and Metrics WG Meeting 5 perfSONAR operations

But some of the issues still not understood – Very difficult to debug since everything has to be done remotely in LHCOPN/LHCONE – Attempted to get admin access to OPN perfSONARs - problematic for some sites due to policies Changed focus to 3.4.2rc – established perfSONAR testbed – Controlled environment – Participants: AGLT2, BNL, UNL, IU, MWT2 webui/index.cgi?dashboard=perfSONAR%20Testbed webui/index.cgi?dashboard=perfSONAR%20Testbed – Focus on validation and testing of release candidates (3.4.2rc) – Following up via infrastructure monitoring - includes base sonar checks, freshness checks, additional os/host level checks – Confirmed issue with OWAMP nodes and regular testing not running – Issues found were fixed, validation still ongoing Network Monitoring and Metrics WG Meeting 6 perfSONAR operations

7 esmond Status/Plans Network Monitoring and Metrics WG Meeting

Regular meetings held in OSG - Shawn coordinates the activity Planning to make it production ready by Q3 Performance was improved on VMs: configured RSV data-gathering probes to distribute starts – Original conflicted with other I/O intensive VMs – Gathering 100% of meshes (But many w/o data  ) Completeness and accuracy – Plan is to re-use the freshness tests to check accuracy and completeness of the data Network Monitoring and Metrics WG Meeting 8 esmond Status/Plans

9 Network performance incidents follow up Network Monitoring and Metrics WG Meeting

Incidents not frequent but very difficult to resolve Complex tracking is needed since many parties are involved Difficult to identify the real problem Missing process to track and follow up on incidents Case in point - recent AGLT2 – SARA incident – Detected in FTS, confirmed by FAX and perfSONAR – Investigated and narrowed down with current perfSONAR network – Ticket opened with Internet2 – followed up with a coordinated ticket in GEANT – Issue still unresolved Network Monitoring and Metrics WG Meeting 10 Network Incidents Follow up

Problems found SARA->AGLT2. FTS timed out because rate for large files < Kbytes/sec perfSONAR tests confirmed similar results Setup “temp” debug mesh using mesh-config GUI – webui/index.cgi?dashboard=Debug%20Mesh%20(temp) webui/index.cgi?dashboard=Debug%20Mesh%20(temp) Opened ticket with I2, who opened ticket with GEANT – GEANT brought up LHCONE pS instance. Tests to AGLT2 showed 3 times BW vs (much closer) SARA – Problematic link identified % packet loss Network Monitoring and Metrics WG Meeting 11 perfSONAR in Troubleshooting Impacts overall link throughput. Many fixes tried. No solution yet. Where should we track/document cases?

Process strawman TBD Experiments – Report to mailing list (wlcg-network-throughput ?) – with experiments, wlcg perfsonar support unit (incl. lhcone/lhcopn experts) – WLCG perfSONAR support unit to confirm if this is network issue - contacts sites(?) – Concerned sites report to their NREN informing mailing list – List of ongoing incidents on WG page – detailed to be provided Sites – Report to their NREN, inform mailing list – WLCG perfSONAR support unit can assist if needed – Escalate to WLCG operations coordination to resolve non- technical (policy) inter-site issues Network Monitoring and Metrics WG Meeting 12 Network Incidents Follow up

13 Integration projects Network Monitoring and Metrics WG Meeting

Meeting held 25 th of February – revise the proposal taking into account comments received during the last meeting Aim – Develop experiment’s interface to perfSONAR – Enable possibility to subscribe to different events (filter) and support different clients - integration via messaging, streaming data via topic/queue – Provide mapping/translation btw sonar infrastructure and experiment’s topology Current event-types measured by sonar infrastructure – histogram-owdelay – one way delays over time period – in total 16k events/5 mins (currently 10%) - each event contains histogram – packet-loss-rate – number of packets lost/packets sent – in total 16k links/5 mins (currently 10%) – each event contains key/value pairs (ts, loss) – packet-count-sent – packets sent – packet-count-lost – packets lost – packet-trace – in total 16k events/hour (currently 60%) - each event contains tracepath – throughput – observer amount of data sent over period of time – in total 16k events/week – each event contains key/value pair (ts, throughput) Network Monitoring and Metrics WG Meeting 14 Experiments interface to esmond

Network Monitoring and Metrics WG Meeting 15 Architecture (Current)

Network Monitoring and Metrics WG Meeting 16 Architecture (TBD)

RSV probes (OSG) – collecting metrics esmond (OSG) – datastore esmond2mq – Henryk developed a prototype in python – Retrieves all data (meta+raw) from esmond depending on existing mesh configs – Publishes to a topic – Lessons from prototype - esmond currently too slow to get raw data – requires optimization on both sides Proximity/topology service – Handle mapping/translation of services (service to service; storage to sonar), service to site (sonar to site) – Test different algorithms (site mapping, traceroutes, geoip) – Evaluate if existing tools can be reused for this purpose Clients – consume data from topic, optionally connect to proximity/topology service and re-publish extended metrics – but also connect directly to esmond if needed Network Monitoring and Metrics WG Meeting 17 Components

Network Monitoring and Metrics WG Meeting 18 AOB