Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.

Slides:



Advertisements
Similar presentations
IBM SMB Software Group ® ibm.com/software/smb Maintain Hardware Platform Health An IT Services Management Infrastructure Solution.
Advertisements

Networking Essentials Lab 3 & 4 Review. If you have configured an event log retention setting to Do Not Overwrite Events (Clear Log Manually), what happens.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Update on OSG/WLCG perfSONAR infrastructure Shawn McKee, Marian Babik HEPIX Spring Workshop, Oxford 23 rd - 27 th March 2015.
Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
1 EMC Storage Plug-in for Oracle Enterprise Manager 12c Version Product Overview.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
Windows Server 2008 Chapter 11 Last Update
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Chapter 17: Watching Your System BAI617. Chapter Topics Working With Event Viewer Performance Monitor Resource Monitor.
BMC Software confidential. BMC Performance Manager Will Brown.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Troubleshooting Your Network Networking for Home and Small Businesses.
LBTO IssueTrak User’s Manual Norm Cushing version 1.3 August 8th, 2007.
Use Cases. Summary Define and understand slow transfers – Identify weak links, narrow down the source – Understand what perfSONAR measurements mean wrt.
Inventory:OCSNG + GLPI Monitoring: Zenoss 3
Publication and Protection of Site Sensitive Information in Grids Shreyas Cholia NERSC Division, Lawrence Berkeley Lab Open Source Grid.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
Network Monitoring for OSG Shawn McKee/University of Michigan OSG Staff Planning Retreat July 10 th, 2012 July 10 th, 2012.
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Nagios Demonstration Tom Wlodek SLAC Tier2 workshop
Managing the Oracle Application Server with Oracle Enterprise Manager 10g.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 8 th April 2015.
New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik perfSONAR Operations Sub-group 22 nd October 2014.
Developer TECH REFRESH 15 Junho 2015 #pttechrefres h Understand your end-users and your app with Application Insights.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
Chapter 13: LAN Maintenance. Documentation Document your LAN so that you have a record of equipment location and configuration. Documentation should include.
PerfSONAR-PS Functionality February 11 th 2010, APAN 29 – perfSONAR Workshop Jeff Boote, Assistant Director R&D.
Network and Transfer Metrics WG Meeting Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 18 h March 2015.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Nexthink V5 Demo ITSM – Users Impacted. Situation › It’s Wednesday morning › Last night the infrastructure team we worked hard on a proxy migration We.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.
US LHC Tier-2 Network Performance BCP Mar-3-08 LHC Community Network Performance Recommended BCP Eric Boyd Deputy Technology Officer Internet2.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
© 2014 Level 3 Communications, LLC. All Rights Reserved. Proprietary and Confidential. Simple, End-to-End Performance Management Application Performance.
LHCONE NETWORK SERVICES: GETTING SDN TO DEV-OPS IN ATLAS Shawn McKee/Univ. of Michigan LHCONE/LHCOPN Meeting, Taipei, Taiwan March 14th, 2016 March 14,
Deploying perfSONAR-PS for WLCG: An Overview Shawn McKee/University of Michigan WLCG PS Deployment TF Co-chair Fall 2013 HEPiX Ann Arbor, Michigan October.
ITEC 275 Computer Networks – Switching, Routing, and WANs Week 12 Chapter 14 Robert D’Andrea Some slides provide by Priscilla Oppenheimer and used with.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
PerfSONAR operations meeting 3 rd October Agenda Propose changes to the current operations of perfSONAR Discuss current and future deployment model.
Monitoring Windows Server 2012
Shawn McKee, Marian Babik for the
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
LHCOPN/LHCONE perfSONAR Update
Networking for the Future of Science
Shawn McKee/University of Michigan ATLAS Technical Interchange Meeting
Alerting/Notifications (MadAlert)
Deployment & Advanced Regular Testing Strategies
Network Monitoring Update: June 14, 2017 Shawn McKee
DRC Central Office Services
The Troubleshooting theory
Presentation transcript:

Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2 Facilities Workshop March 23, 2015

Initial Remarks on perfSONAR  Over the last few years WLCG sites have converged on perfSONAR as their way to measure and monitor their networks for data-intensive science.  Not easy to get global consensus but we have it now. Now mandated by all LHC VOs  Globally distributed infrastructure requires network measurement to:  Understand and baseline network capacity between resource sites  Identify and quickly fix network problems  Inform higher-level services for decision support in network use  A Modular dashboard (MaDDash) is critical for “visibility” into networks. We can’t manage/fix/respond-to problems if we can’t “see” them.  OMD/Check_mk (used to monitor and verify the state of many globally distributed perfSONAR services) is required to maintain the overall proper functioning of the monitoring infrastructure.  The development of the “mesh-configuration” and corresponding GUI interface was critical to creating a scalable, manageable deployment for WLCG/OSG  Having perfSONAR fully deployed with a global dashboard is giving us powerful options for better management and use of our network OSG-AHM2

Vision for perfSONAR-PS in WLCG/OSG OSG-AHM3  Primary Goals:  Find and isolate “network” problems; alerting in a timely way  Characterize network use (base-lining)  Provide a source of network metrics for higher level services  First step: get monitoring in place to create a baseline of the current situation between sites  Next: continuing measurements to track the network, alerting on problems as they develop  Choice of a standard “tool/framework”: perfSONAR  We want to benefit from the R&E community consensus  perfSONAR’s purpose is to aid in network diagnosis by allowing users to characterize and isolate problems. It provides measurements of network performance metrics over time as well as “on-demand” tests.

Network Awareness  Since it’s part of my presentation title we should cover this! (More on this on my Wednesday talk)  Networks underlie our distributed computing model but are historically only indirectly visible. This led many to feel most problems with a WAN involved were network problems (and sometimes that was true).  perfSONAR is part of an evolving infrastructure where the network plays a much more visible role. With perfSONAR we can monitor our network, understand capacity, find bottlenecks and detect problems. It is NOT the only thing we need.  With Software Defined Networking slowly creeping into our network hardware we will have more opportunity in the future to integrate the network we need into our end-to-end systems.  The goal is to make the network visible and controllable to improve our infrastructure, avoid congestion, work around failures and improve efficiency. OSG-AHM4

Challenges Deploying/Supporting perfSONAR  So I think we would all agree there are good reasons to deploy and use perfSONAR (pause for dissent…………)  But I also know some sites have had more “work” than planned in doing this  Our original goal was to provide something that was as much “set-it-and-forget-it” as possible.  Hard: A complex set of services deployed in many ways on heterogeneous hardware.  Compared to other services required for LHC it is still not very difficult to deploy or upgrade.  Ease of installation and management still a primary motivator for us.  Issues encountered [solutions]  Difficulty installing because of hardware specifics or need to integrate with build systems. [  Misconfiguration apparently still to easy to have happen [Mesh-config, check_mk tests]  Tests stop running or stop gathering metrics [New version coming]  Disks filling, services take too many resources [Reconfig logging; default 3.4.2; mem too small] OSG-AHM5

Improving perfSONAR  We are hard at work identifying issues and working with the perfSONAR developers to get us where we want to be.  WLCG/OSG is larger than other user communities. Because of our scale and use-cases we often identify weaknesses not easily found by smaller deployments.  We are providing “user/admin” interfaces to make the system more valuable for everyone.  Service/host monitoring via OMD/Check_MK (see URLs at end)  MaDDash to expose results  Central OSG Network Datastore to provide one-stop access to all metrics  We have a testbed to try new versions of perfSONAR out at suitable scale before new releases are made.  Nebraska participates; very helpful to identify subtle issues; being used to vet 3.4.2RC now.  See MaDDash at webui/index.cgi?dashboard=perfSONAR%20Testbedhttps://maddash.aglt2.org/maddash- webui/index.cgi?dashboard=perfSONAR%20Testbed OSG-AHM6

Monitoring/Managing perfSONAR OSG-AHM7  Our WLCG working group (Network and Transfer Metrics WG) is organizing and managing our perfSONAR deployment.   wlcg-perfsonar-support ‘at’ cern.ch  OSG is providing our core services:  MaDDash GUI for perfSONAR metrics  OMD/Check_Mk (Nagios system to monitor base services)  Network datastore (using perfSONAR Esmond measurement archive)  Mesh-creation and management system

OMD/Check_MK OSG-AHM8 We are using OMD & Check_MK to monitor our perfSONAR hosts and services. Provides useful overview of status/problems [Requires x509 in your browser]

MaDDash for Network Metric Visualization OSG-AHM9 Visibility of our perfSONAR metrics are being exposed using Esnet’s MaDDash (Monitoring and Debugging Dashboard) which presents a quick way to see the status of our network measurements. The colors indicate status (based upon configured levels) where “OK” is green, “WARNING” is yellow and “CRITICAL” is red. What we don’t want to see is orange which indicates the measurement data is not available. On the right are the US CMS bandwidth and latency meshes as of today: webui/index.cgi?dashboard=USCMS%20Mesh%20Confighttp://psmad.grid.iu.edu/maddash- webui/index.cgi?dashboard=USCMS%20Mesh%20Config For debugging examples see my talk from APAN August 2014: NE_perfSONAR_status-APAN.pdf

OSG Network Datastore OSG-AHM10  A critical component is the datastore to organize and store the network metrics and associated metadata  OSG has begun gathering relevant metrics from the complete set of OSG and WLCG perfSONAR instances  This data will be available via an API, must be visualized and must be organized to provide the “OSG Networking Service”  Targeting a production service by mid-summer  Examples at end of slides

Near Term Goals: Where Are We Going?  We are working on a number of improvements. Perhaps the most important one should be released this week: perfSONAR Everyone who installed 3.4 should “automatically” update within 24-hours (unless you turned it off  )  The OSG Network Datastore as our repository of network measurements is targeting a production-ready date of July This is planned to be a permanent archive of all WLCG/OSG perfSONAR metrics going forward.  Improvements in MaDDash data presentation (linking labels back to toolkit instances, more customization options, bug-fixes)  Development of PuNDIT agents that use perfSONAR measurements to identify (in near real-time) network problems and a corresponding central service to localize where those problems are coming from.  Longer term improvements in perfSONAR regarding network topology and in support problem characterization and reporting OSG-AHM11

Discussions / Demos  I wanted to reserve time for an open discussion about your experiences with perfSONAR  What issues do you have?  What would you like to know more about?  Do you know where to go to get information?  Questions, comments, suggestions? Floor is open…. OSG-AHM12

ADDITIONAL SLIDES Further reference OSG-AHM13

Relevant URLs For More Information  OSG/WLCG documentation “tree” at  perfSONAR documentation at  Troubleshooting information at testing/network-troubleshooting-quick-reference-guide/ testing/network-troubleshooting-quick-reference-guide/ testing/network-troubleshooting-quick-reference-guide/  OMD/Check_MK (use a browser with your x509 cert!): WLCGperfSONAR%2Fcheck_mk%2Fdashboard.py WLCGperfSONAR%2Fcheck_mk%2Fdashboard.py WLCGperfSONAR%2Fcheck_mk%2Fdashboard.py  MaDDash: (prototype) webui/index.cgi?dashboard=USCMS%20Mesh%20Config (production; relies upon datastore) webui/index.cgi?dashboard=USCMS%20Mesh%20Config webui/index.cgi?dashboard=USCMS%20Mesh%20Confighttp://psmad.grid.iu.edu/maddash- webui/index.cgi?dashboard=USCMS%20Mesh%20Confighttps://maddash.aglt2.org/maddash- webui/index.cgi?dashboard=USCMS%20Mesh%20Confighttp://psmad.grid.iu.edu/maddash- webui/index.cgi?dashboard=USCMS%20Mesh%20Config  PuNDIT project OSG-AHM14

perfSONAR Global Deployment Trend OSG-AHM15 As of February 2015 there are almost 1300 perfSONAR deployments worldwide. The OSG/LHC deployment accounts for 247 of those. The figure on the right shows the deployment count by version of perfSONAR along with some events highlighted in time via the red arrows. Security issues in the underlying OS for perfSONAR helped motivate the upgrade to v3.4 Diagram courtesy of Jason Zurawski/ESnet

 esmond – Postgress + Cassandra  populated by RSV probes  REST API available – python/perl libs  curl " owdelay/base?time-range=86400"  Data organized in events  packet-trace  histogram-owdelay – one way delays over time period  ntp-delay – round trip delay time to NTP server  packet-loss-rate – number of packets lost/packets sent  packet-count-sent – packets sent  packet-count-lost – packets lost  packet-retransmits – packets retransmitted for a transfer using TCP  throughput – observer amount of data sent over period of time  failures – record of test failures Network Monitoring and Metrics WG Meeting16 OSG Datastore Details

Network Monitoring and Metrics WG Meeting17 Datastore Structure

Network Monitoring and Metrics WG Meeting18 Data Examples: Throughput vs OWdelay vs Trace