IEPM-BW Deployment Experiences

Slides:



Advertisements
Similar presentations
Backing Up Your Computer Hard Drive Lou Koch June 27, 2006.
Advertisements

1 Traceanal: a tool for analyzing and representing traceroutes Les Cottrell, Connie Logg, Ruchi Gupta, Jiri Navratil SLAC, for the E2Epi BOF, Columbus.
1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Maintaining and Updating Windows Server 2008
ABC Co. Network Implementation High reliability is primary concern – near 100% uptime required –Customer SLA has stiff penalty clauses –Everything is designed.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger
Linux Operations and Administration
Hands-On Microsoft Windows Server 2008
Chapter 18: Windows Server 2008 R2 and Active Directory Backup and Maintenance BAI617.
Network Aware Module Implementation of the paper: “Forecasting Network Performance to Support Dynamic Scheduling Using the Network Weather Service”. Its.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.
1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February
IEPM-BW Deployment Experiences Connie Logg SLAC Joint Techs Workshop February 4-9, 2006.
Integration of AMP & Tracenol By: Qasim Bilal Lone.
DataGrid Wide Area Network Monitoring Infrastructure (DWMI) Connie Logg February 13-17, 2005.
70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network, Enhanced Chapter 5: Managing and Monitoring DHCP.
1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.
IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.
Maintaining and Updating Windows Server Monitoring Windows Server It is important to monitor your Server system to make sure it is running smoothly.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
July 19, 2004Joint Techs – Columbus, OH Network Performance Advisor Tanya M. Brethour NLANR/DAST.
BOF Discussion: Uploading IEPM-BW data to MonALISA Connie Logg SLAC Winter 2006 ESCC/Internet2 Joint Techs Workshop ESCCInternet2ESCCInternet2 February.
Ch. 31 Q and A IS 333 Spring 2016 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.
BY: SALMAN 1.
SQL Database Management
IEPM-BW Deployment Experiences
Click to edit Master subtitle style
Project Management: Messages
BY: SALMAN.
Archiving and Document Transfer Utilities
Project Center Use Cases Revision 2
Project Center Use Cases
Segments Basic Uses: slides minutes
Network Tools and Utilities
James Casey, IT-GD, CERN CERN, 5th September 2005
ALICE Monitoring
NUUO Tools Welcome to NUUO general education service. This session allows users to have the overview of NUUO tools for system design. (Click)
Project Center Use Cases
22-INTEGRATION HUB
BOF Discussion: Uploading IEPM-BW data to MonALISA
LCGAA nightlies infrastructure
GLAST Release Manager Automated code compilation via the Release Manager Navid Golpayegani, GSFC/SSAI Overview The Release Manager is a program responsible.
Project Center Use Cases Revision 3
Introduction to Networks
Project Center Use Cases Revision 3
Test Upgrade Name Title Company 9/18/2018 Microsoft SharePoint
DHCP, DNS, Client Connection, Assignment 1 1.3
Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.
Using Netflow data for forecasting
Connie Logg, Joint Techs Workshop February 4-9, 2006
Internet Control Message Protocol Version 4 (ICMPv4)
Connie Logg February 13 and 17, 2005
Experiences in Traceroute and Available Bandwidth Change Analysis
Experiences in Traceroute and Available Bandwidth Change Analysis
Network Performance Measurement
Internet Control Message Protocol
SLAC monitoring Web Services
MAGGIE NIIT- SLAC On Going Projects
The CALgorithm for Detecting Bandwidth Changes
The Troubleshooting theory
Summer 2002 at SLAC Ajay Tirumala.
TCP/IP Protocol Suite 1 Chapter 9 Upon completion you will be able to: Internet Control Message Protocol Be familiar with the ICMP message format Know.
Presentation transcript:

IEPM-BW Deployment Experiences Connie Logg SLAC Joint Techs Workshop February 4-9, 2006

Background Originally conceived September 13, 2001 and developed as an exhibit for SC2001 in November 2001 on Solaris Looked to be useful so development continued After SC2001, it was installed on a Solaris host shared with other applications. Other applications interfered. Original configuration file was a set of perl commands which defined the nodes, their configuration information, and the probes and parameters for each node. Very hard to understand, maintain, modify, and manage. Quick port to Linux and moved to its own host. Still used perl commands as configuration database.

Background - continued As the development proceeded, it was obvious that the configuration information for nodes and probes was no longer manageable. Enter the MYSQL data base. Whole package was redesigned using MySQL data base Node specifications, monitoring host specifications, probe specifications, plot specifications, path specifications, and data are all maintained in the MySQL data base. Much more readable (web pages to display contents of monitoring configuration information), manageable and adaptable to changing needs and specifications

Conceptual Changes/Challenges The conception of what IEPM-BW should do and which probes it should use has changed over time. Monitor with ping, traceroute, abwe Added iperf Added file transfers to the tests: bbcp, bbftp, gridftp – Discontinued because: Performance tracked iperf Disk speed is the overriding factor in throughput Monitoring and target hosts not likely to be equipped with high speed disks Disk latency studies are important, but should not be part of IEPM-BW Added Pathload for available bandwidth measurements Removed Pathload (suggestion that it was too intense) Added Pathchirp Added Thrulay to compare with Iperf Added Pathload back in per suggestions from collaborators at Ultralight Meeting

Currently Removed abwe as it was too noisy and did not work well on gigabit networks. Evaluating Pathload vs Pathchirp and may remove Pathchirp…more likely will not run it to all nodes, just ones for which it works well. May have to use different types of probes for different types of networks and distances between nodes – ONE SIZE DOES NOT FIT ALL All probes, presentation, and analysis is evolving as we understand more about the networking environments…which are themselves evolving.

Analysis and Presentation Analysis and data presentation ideas change Timeseries plots first plots we had Of interest from the plot below Pathchirp not very good in some cases – reports > 1Gb thruput Pathload more stable and probably accurate RTT change in ping very clear - and seems to have no effect in this case – but does in others – note that it correlated with traceroute change

Analysis and Presentation Added diurnal analysis to look at it and how it might be useful in event detection (bandwidth change) and possibly prediction

Analysis and Presentation Scatterplots – useful for looking at correlations Cross-plots (Y axis: pathchirp & iperf) vs X-axis: Thrulay

Analysis and Presentation Added histograms to provide frequency distribution and CDF Shows possible multimodal distribution of achievable thruput measurements via thrulay But available bandwidth for the same node (by pathchirp) is stable

Analysis and Presentation Packet Loss

Traceroute Visualization One compact page per day One row per host, one column per hour One character per traceroute to indicate pathology or change (usually period(.) = no change) Identify unique routes with a number Be able to inspect the route associated with a route number Provide for analysis of long term route evolutions Route # at start of day, gives idea of route stability Multiple route changes (due to GEANT), later restored to original route Period (.) means no change

Event Detection (throughput drops) Must clearly define what you are looking for How much change and in what time period How to determine if it is time to alert again (don’t want repeated alerts for same drop) Use the above to figure out how often you want to probe. Do not overprobe…try to establish necessary frequency, and if that does the job, that is enough

Implementation Challenges Functions such as ping require different options and parsing on different OSs. When upgrading versions of the probe software, processing code may need to be modified because of output format changes. Not only must upgrade monitoring host probe software, but also target host server versions Being able to track what is working and what is not working and troubleshooting when code performance changes for the worse. Automating distribution and maintenance Which versions of gnuplot and drivers, MySQL and perl are available? Do they meet our needs? Keeping the servers alive (target kit) Monitoring and target hosts losing disks or having the OSs upgraded. Maintaining proper TCP buffer sizes

Implementation Challenges Many probes have to be done in a synchronous fashion. Do not run iperf, thrulay, and pathload at the same time. Do not want to overload the network with probing activities – this constrains the number and frequency of probes that can be made Currently high impact probes are short (20 seconds or less) and code only allows at most one probe to run within a minute. If a process (probe, script, gnuplot, etc.) cannot hang…it will hang – Time everything out and watch for hangs so they can be automatically cleaned up.

Current Implementation MySQL tables for all configuration information NODES – contains node definitions and path information for that node; all nodes, target and monitoring hosts are defined in this table MONHOST – monitoring host specific information and plotting spec for all the data TOOLSPECS – specification for each probe as well a plotting spec for the data and ‘last run’ field. PLOTSPECS – miscellaneous plotting specifications (scatterplots, timeseries plots, other plot types)

Current Implementation MySQL tables for data storage ABWEDATA – being discontinued (first data table) BWDATA – All bandwidth data is stored here contains fields for: RTT min, max, average, standard deviation Thruput min, max, average, standard deviation, and final throughput Number of streams, windowsize Text results from probe Time of probe Not all fields used for all data types

Current Implementation Tables for Traceroute data ROUTENO – each route seen is given a unique identifier(routeno), and the row contains srcnode, destnode, firstseen, lastseen, ip hop list ROUTEDATA – routeno (from ROUTENO table), text of traceroute output, number of hops, ip hop list, time of probe Historical route data may be interesting to analyze for route changes over time, but no one has had the time or interest to do it. NEW Coming Soon: ASN tables to store ASN info for hops – this is useful as it speeds up interactive drawing and display, and analysis of the traceroutes

Current Implementation SCHEDULE table holds the scheduling information for each probe, and tracks what state it is in. Each and every probe made (including ping) has a unique schedule ID which identifies the probe and all the parameters of the probe Scheduler checks the TOOLSPEC table to ascertain what probes are due to be run and inserts them in the SCHEDULE table Scheduled probes are only run if they are within the “current” time period. This prevents a large number of probes from being stacked up and flooding the network for a long time.

Trouble Shooting Every script has a log file where it records errors and performance information such as how long it took to make a pass. These log files are rotated nightly, and kept for 7 days (easily changed) Hanging processes (probes) are a fact of life. Timeout all probes Create a cleanup script that looks for processes which have been active longer than they should be and kills them

Troubleshooting Lingering tasks report – A report showing schedule probes that were not run is generated every day. This is important, as if there are many probes not being run in the nick of time, it may mean that too many are being scheduled to run or that there is a performance problem. Logging Report – A report showing the number of successful probes made, data base write failures, and other failure modes is generated. The info for this report is taken from the data logging log files.

Troubleshooting NETFLOW records are valuable tool Code running fine for years TCP orphan sockets messages crashed machine Netflow records for some 20 second iperf probes were lasting for > 1 minute (some 4 minutes) Change in behavior from the past – were lasting 20-25 seconds Disabled iperf probes and system stabilized Now need to figure out what goes on with iperf probes…not all troublesome, just a few nodes

Performance Issues When probes show degradation in network performance Is it the network? Is it the monitoring node? – JAVA very bad experience Is it the target node? Recommendation: Have a local target host as a sanity check – also good to use as a target host from other monitoring hosts The monitoring hosts should be dedicated systems Monitor monitoring host load with Lisa, Ganglia, Nagios, APmon to MonALISA, etc.

Performance Issue Example – Bad JAVA Program Caltech monitoring host as seen from iepm-bw@slac CALTECH target host as seen from iepm-bw@slac SLAC target host as seen from iepm-bw@slac

Problems Node name disappears from DNS Ports get suddenly blocked Disks crash (lost the entire CALTECH data base – backup was on same physical disk) – need separate physical disk for local backup Monitoring and target hosts get OS upgrades without warning installed code disappears Data bases get zapped We are now working on backing up data bases and source code configuration information to SLAC once a day. Utility packages (gnuplot, for example) get silently upgraded Discussion about distributing our own

Future Directions Automate installation and configuration process Manage code with CVS and distribute via pacman cache Deploy IEPM-BW for LHC monitoring – see if it is useful and/or relevant – if so, it can be expanded and developed to meet changing needs Upload monitoring data and alerts to MonALISA Implement OWAMP and BWCTL Look at Pathneck Implement min and max (maybe also average) RTT analysis and integrate it with other change analysis

Summary Are you monitoring to determine problems or monitoring for forecasting? They are very different but can both be done with same monitoring With respect to real disk to disk transfers – the disk latency is the overwhelming factor. The monitoring can tell you how the network is performing, but this is not necessarily related to application performance. Bearing this in mind, I do not think we need to perform disk to disk transfers with the monitoring systems or intensive network testing Be prepared to be flexible in your architecture. Networks themselves are constantly evolving and so the probes, analysis, and presentation must also evolve. What would I have done differently along the way? In hindsight, not a lot. It has been a constant process of learning. The code adapted fairly well to the research we needed to do – Remember it started as an exhibit for SC2001 and has been a research and learning tool since then. More manpower would have been very useful and if it had been available, the code, package structure and the documentation would be more professional, and the change analysis and prediction/forecasting would be more complete.

References: QUESTIONS? http://www-iepm.slac.stanford.edu/ http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html Papers/web pages on web100, netflow, and active measurement correlation: http://www.slac.stanford.edu/cgi-wrap/getdoc/slac-pub-9641.pdf http://www.slac.stanford.edu/comp/net/bandwidth-tests/web100/ Recommended monitoring and target host configurations IEPM-BW Installation and PLM (being updated and reorganized) Contributors: Les Cottrell, Jerrod Williams, Mahesh Chhaparia, I-Heng Mei, Manish Bhargava, Jiri Navratil, Yee Ting-Li, all at SLAC now or in the past; Maxim Grigoriev(FNAL), and developers of the probes we use. QUESTIONS?

Extra Slides

Installation Challenges - perl Perl is not located on all machines in the same place. Tried to settle on /usr/bin/perl Needed to install various perl modules Some conflict with already installed modules Haven’t done this but one possibility is to have a private version of perl which has been configured for this application.

Installation Challenges - MySQL Is MySQL already installed? What version? 3.x.y is not compatible with 4.x.y and 5.x.y – may have to install it Used RPMs. Some releases of RPMs were buggy (5.0.16 vs 5.0.18). Install from source? Need to install perl bundle for MySQL. Where to install it (/usr/local/bin/perl or /usr/bin/perl or ?) Could not install it in /usr/local/bin/perl at SLAC because it had an old version and was in AFS. Selected /usr/bin/perl. Installation of MySQL and perl drivers requires attention to the details and order is important