Alerting/Notifications (MadAlert)

Slides:



Advertisements
Similar presentations
Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.
Advertisements

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
Proximity service Main idea – provide “glue” between experiments and sonar topology – mainly map sonars to storages and vice versa – determine existing.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Use Cases. Summary Define and understand slow transfers – Identify weak links, narrow down the source – Understand what perfSONAR measurements mean wrt.
Connect communicate collaborate perfSONAR MDM updates: New interface, new possibilities Domenico Vicinanza perfSONAR MDM Product Manager
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) WLCG GDB, CERN 8 July 2015.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP35, Liverpool 11 Sep 2015.
The production deployment of IPv6 on WLCG David Kelsey (STFC-RAL) CHEP2015, OIST, Okinawa 16 Apr 2015.
Connect communicate collaborate perfSONAR MDM updates: New interface, new weathermap, towards a complete interoperability Domenico Vicinanza perfSONAR.
The HEPiX IPv6 Working Group David Kelsey EGI TF, Prague 18 Sep 2012.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 8 th April 2015.
Connect. Communicate. Collaborate Implementing Multi-Domain Monitoring Services for European Research Networks Szymon Trocha, PSNC A. Hanemann, L. Kudarimoti,
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015.
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
WLCG Network and Transfer Metrics WG After One Year Shawn McKee, Marian Babik GDB 4 th November
HEPiX IPv6 Working Group David Kelsey GDB, CERN 11 Jan 2012.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
Julia Andreeva on behalf of the MND section MND review.
PerfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Amsterdam, NL October 28 th, 2015.
WLCG Latency Mesh Comments + – It can be done, works consistently and already provides useful data – Latency mesh stable, once configured sonars are stable.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.
The HEPiX IPv6 Working Group David Kelsey (STFC-RAL) EGI OMB 19 Dec 2013.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
LHCOPN / LHCONE Status Update John Shade /CERN IT-CS Summary of the LHCOPN/LHCONE meeting in Amsterdam Grid Deployment Board, October 2011.
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
Bob Jones EGEE Technical Director
WLCG IPv6 deployment strategy
Monitoring Evolution and IPv6
Shawn McKee, Marian Babik for the
LHCOPN/LHCONE status report pre-GDB on Networking CERN, Switzerland 10th January 2017
WLCG Network Discussion
LCG Service Challenge: Planning and Milestones
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
perfSONAR-PS Deployment: Status/Plans
James Casey, IT-GD, CERN CERN, 5th September 2005
POW MND section.
LHCOPN/LHCONE perfSONAR Update
Networking for the Future of Science
LHCOPN/LHCONE perfSONAR Update
Support for IPv6-only CPU – an update from the HEPiX IPv6 WG
Update from the HEPiX IPv6 WG
Venue and Participants
Deployment of IPv6-only CPU on WLCG - update from the HEPiX IPv6 WG
System performance and cost model working group
Deployment & Advanced Regular Testing Strategies
LHCONE perfSONAR: Status and Plans
Tony Cass, Edoardo Martelli
LCG Operations Centres
Network Monitoring Update: June 14, 2017 Shawn McKee
WLCG and support for IPv6-only CPU
WLCG Collaboration Workshop;
HEPiX IPv6 Working Group F2F Meeting
ESnet Network Measurements ESCC Feb Joe Metzger
Performance Measuring & Monitoring
IPv6 update Duncan Rand Imperial College London
Presentation transcript:

Alerting/Notifications (MadAlert) WLCG-wide meshes Latency mesh: 94 sonars (84% efficiency) Traceroute mesh: 115 sonars (94% efficiency) Bandwidth mesh: 102 sonars (76% efficiency) Stream Publishing from ITB available (issues reported with VM) Publishing from production planned 13th of October Datastore (OSG) In production since 14th of Sept Proximity Mapping btw sonars and storages for all experiments available (updated, mainly bugfixes) Dashboard psmad connected to the datastore already showing good content – shows more recent results than maddash.aglt2 Alerting/Notifications (MadAlert) Initial version of madalert available shows network/infrastr. problems

perfSONAR 3.5 Released on Monday (28th Sept) Support for centOS, Debian and VMs (packaged bundles) perfSONAR Tools (just tools) perfSONAR TestPoint (passive, no MA) perfSONAR Core (+MA) perfSONAR Complete (+Web and Toolkit Configuration) perfSONAR Central Management (Maddash, Auto-config, Centralized config service) Introduces new web frontend Support for low-cost nodes WLCG Deployment status (very good progress): 3.4.1 : 7 3.4.2 : 67 3.5 : 154 Unknown: 20

WLCG wide meshes Summary of changes: Re-enabled project meshes Full latency (one-direction only, 10Hz, OWAMP, IPv4) Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6) Full bandwidth (one-direction only, fortnightly, BWCTL-only !, IPv4, IPv6) Re-enabled project meshes Belle II – both latency and bandwidth Dual-stack – just bandwidth (both IPv4 and IPv6) Regional meshes still disabled, need to discuss how to evolve We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and using same params) We could move from regional to bigger meshes (European, Asia/Pacific, US) We can create new bandwidth meshes as bwclt needs fewer resources (but only for BWCTL-only nodes, not on dual-nodes)

Latency full mesh issues Taiwan – performance, high load (long thread) DESY-HH – unstable (GGUS) Manchester – works fine, but unstable (gateway timeouts, MA unreachable) Durham – only works for UK (firewall ?) MWT2, Oklahoma, SLAC - only after update to 3.5 Florida, UCSD – performance, high load Wisconsin – offline (temporary) INFN-Roma – offline (temporary)

Remaining sonars Those that were not added to the full mesh and are still in global mesh (latency or bandwidth) Detailed summary sent to wlcg-perfsonar-support, three categories: Offline or misconfigured sonars Performance issues (Aachen, Sonars with <4 GB RAM How can we integrate sonars with constrained resources ?

Integration FTS performance study meeting held 15th of Sept. TCP buffer size limit - new algorithm proposed and discussed – to be followed up by FTS team reported details on SRM overhead From the use case document: Integration of perfSONAR in the ATLAS data analytics, Panda, SSB Integration with DIRAC (LHCb) CERN IT Data analytics WG interested in perfSONAR For CMS we’re currently missing contact person (to be followed up) Also interest from network community (Asia Tier Centre Forum)

HEPiX Abstract submitted, plan is to focus on sites: show importance of measuring network performance and impact of latency and packet loss in throughput discuss existing network, it’s coverage and capabilities show how one can discover nearest sonars in our network (proximity) describe existing tools and show examples how you can run them from command line to debug specific problems discuss existing deployment models and options (VMs, Puppet, perfSONAR on $200 box, etc.) and new features of perfSONAR 3.5 Based on ESNet tutorials http://www.es.net/assets/pubs_presos/20130113-Zurawski-DMZ-Tutorial-PerfSONAR2.pdf

GDB Review of the WG – focus on use cases and overall progress in different areas Define and understand slow transfers perfSONAR commissioning and deployment: support unit, follow up, debugging – Done perfSONAR central configuration and mesh management (and ESnet project) – Done Uniform way to access and integrate existing network measurements Define topology in a common way – proximity service - prototype available, further testing needed Common API – OSG Datastore, publishing results to MQ - Done Integration FTS performance study (Saul) ATLAS and LHCb perfSONAR pilot projects Coordinated response to the network performance issues (ATLAS) WLCG Network Throughput SU and underlying procedure - Done Baseline existing links (full mesh), help commission new ones Establishing WLCG-wide meshes (Done) Running core networking meshes (LHCOPN/LHCONE) to help debug links (Done)

Next steps Stable infrastructure (OSG production date is 13 of Oct) Production stream Fix remaining issues with the OSG datastore Update central dashboard (psmad) – becomes the official production dashboard Update monitoring (few metrics no longer work with 3.5) Discuss and agree on what meshes we introduce (mesh leaders) Follow up on sonars with issues Start focusing on the various integration efforts: Resources need to come from the experiments Support will be shared between OSG and WLCG WG Next meetings: 4th Nov, 2nd Dec