IEPM-BW Deployment Experiences Connie Logg SLAC Joint Techs Workshop February 4-9, 2006.

Slides:



Advertisements
Similar presentations
Backing Up Your Computer Hard Drive Lou Koch June 27, 2006.
Advertisements

1 Traceanal: a tool for analyzing and representing traceroutes Les Cottrell, Connie Logg, Ruchi Gupta, Jiri Navratil SLAC, for the E2Epi BOF, Columbus.
1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,
MAGGIE NIIT- SLAC On Going Projects Measurement & Analysis of Global Grid & Internet End to end performance.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Maintaining and Updating Windows Server 2008
Check Disk. Disk Defragmenter Using Disk Defragmenter Effectively Run Disk Defragmenter when the computer will receive the least usage. Educate users.
Click to edit Master subtitle style Chapter 17: Troubleshooting Tools Instructor:
ABC Co. Network Implementation High reliability is primary concern – near 100% uptime required –Customer SLA has stiff penalty clauses –Everything is designed.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
ADVANCED MICROSOFT POWERPOINT Lesson 6 – Creating Tables and Charts
CS 4700 / CS 5700 Network Fundamentals Lecture 17.5: Project 5 Hints (Getting a job at Akamai) Revised 3/31/2014.
Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
GOODWILL OF NORTHWEST NORTH CAROLINA, INC. EMPLOYEE TRAINING DATABASE PROTOTYPE.
Hands-On Microsoft Windows Server 2008
Chapter 18: Windows Server 2008 R2 and Active Directory Backup and Maintenance BAI617.
Network Aware Module Implementation of the paper: “Forecasting Network Performance to Support Dynamic Scheduling Using the Network Weather Service”. Its.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Problem Determination Your mind is your most important tool!
UbiStore: Ubiquitous and Opportunistic Backup Architecture. Feiselia Tan, Sebastien Ardon, Max Ott Presented by: Zainab Aljazzaf.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
The Network Performance Advisor J. W. Ferguson NLANR/DAST & NCSA.
NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.
workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.
1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Integration of AMP & Tracenol By: Qasim Bilal Lone.
DataGrid Wide Area Network Monitoring Infrastructure (DWMI) Connie Logg February 13-17, 2005.
Measurement & Analysis of Global Grid & Internet End to end performance (MAGGIE) Network Performance Measurement.
1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.
IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.
Maintaining and Updating Windows Server Monitoring Windows Server It is important to monitor your Server system to make sure it is running smoothly.
Guide To UNIX Using Linux Third Edition Chapter 8: Exploring the UNIX/Linux Utilities.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
February 6-8, 2006[Joint Techs] Albuquerque, NM Performance Tool Development: NLANR Network Performance Advisor J. W. Ferguson NCSA.
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
July 19, 2004Joint Techs – Columbus, OH Network Performance Advisor Tanya M. Brethour NLANR/DAST.
Virtual Machine Movement and Hyper-V Replica
A Comparison of Web100, Active, and Passive methods for Throughput Calculation I-Heng Mei 8/30/02.
BOF Discussion: Uploading IEPM-BW data to MonALISA Connie Logg SLAC Winter 2006 ESCC/Internet2 Joint Techs Workshop ESCCInternet2ESCCInternet2 February.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Connect communicate collaborate Performance Metrics & Basic Tools Robert Stoy, DFN EGI TF, Madrid September 2013.
Ch. 31 Q and A IS 333 Spring 2016 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
Maintaining and Updating Windows Server 2008 Lesson 8.
SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.
Windows Server 2003 { First Steps and Administration} Benedikt Riedel MCSE + Messaging
SQL Database Management
IEPM-BW Deployment Experiences
Network Tools and Utilities
James Casey, IT-GD, CERN CERN, 5th September 2005
Maximum Availability Architecture Enterprise Technology Centre.
IEPM-BW Deployment Experiences
BOF Discussion: Uploading IEPM-BW data to MonALISA
LCGAA nightlies infrastructure
Test Upgrade Name Title Company 9/18/2018 Microsoft SharePoint
Using Netflow data for forecasting
Connie Logg, Joint Techs Workshop February 4-9, 2006
Connie Logg February 13 and 17, 2005
Experiences in Traceroute and Available Bandwidth Change Analysis
Experiences in Traceroute and Available Bandwidth Change Analysis
SLAC monitoring Web Services
MAGGIE NIIT- SLAC On Going Projects
Summer 2002 at SLAC Ajay Tirumala.
Presentation transcript:

IEPM-BW Deployment Experiences Connie Logg SLAC Joint Techs Workshop February 4-9, 2006

Background Originally conceived September 13, 2001 and developed as an exhibit for SC2001 in November 2001 on Solaris Looked to be useful so development continued After SC2001, it was installed on a Solaris host shared with other applications. Other applications interfered. Original configuration file was a set of perl commands which defined the nodes, their configuration information, and the probes and parameters for each node. Very hard to understand, maintain, modify, and manage. Quick port to Linux and moved to its own host. Still used perl commands as configuration database.

Background - continued As the development proceeded, it was obvious that the configuration information for nodes and probes was no longer manageable. Enter the MYSQL data base. Whole package was redesigned using MySQL data base Node specifications, monitoring host specifications, probe specifications, plot specifications, path specifications, and data are all maintained in the MySQL data base. Much more readable (web pages to display contents of monitoring configuration information), manageable and adaptable to changing needs and specifications

Conceptual Changes/Challenges The conception of what IEPM-BW should do and which probes it should use has changed over time. Monitor with ping, traceroute, abwe Added iperf Added file transfers to the tests: bbcp, bbftp, gridftp – Discontinued because: Performance tracked iperf Disk speed is the overriding factor in throughput Monitoring and target hosts not likely to be equipped with high speed disks Disk latency studies are important, but should not be part of IEPM-BW Added Pathload for available bandwidth measurements Removed Pathload (suggestion that it was too intense) Added Pathchirp Added Thrulay to compare with Iperf Added Pathload back in per suggestions from collaborators at Ultralight Meeting

Currently Removed abwe as it was too noisy and did not work well on gigabit networks. Evaluating Pathload vs Pathchirp and may remove Pathchirp…more likely will not run it to all nodes, just ones for which it works well. May have to use different types of probes for different types of networks and distances between nodes – ONE SIZE DOES NOT FIT ALL All probes, presentation, and analysis is evolving as we understand more about the networking environments…which are themselves evolving.

Analysis and Presentation Analysis and data presentation ideas change Timeseries plots first plots we had Of interest from the plot below Pathchirp not very good in some cases – reports > 1Gb thruput Pathload more stable and probably accurate RTT change in ping very clear - and seems to have no effect in this case – but does in others – note that it correlated with traceroute change

Analysis and Presentation Added diurnal analysis to look at it and how it might be useful in event detection (bandwidth change) and possibly prediction

Analysis and Presentation Scatterplots – useful for looking at correlations Cross-plots (Y axis: pathchirp & iperf) vs X-axis: Thrulay

Analysis and Presentation Added histograms to provide frequency distribution and CDF Shows possible multimodal distribution of achievable thruput measurements via thrulay But available bandwidth for the same node (by pathchirp) is stable

Analysis and Presentation Packet Loss

Traceroute Visualization One compact page per day One row per host, one column per hour One character per traceroute to indicate pathology or change (usually period(.) = no change) Identify unique routes with a number Be able to inspect the route associated with a route number Provide for analysis of long term route evolutions Route # at start of day, gives idea of route stability Multiple route changes (due to GEANT), later restored to original route Period (.) means no change

Event Detection (throughput drops) Must clearly define what you are looking for How much change and in what time period How to determine if it is time to alert again (don’t want repeated alerts for same drop) Use the above to figure out how often you want to probe. Do not overprobe…try to establish necessary frequency, and if that does the job, that is enough

Implementation Challenges Functions such as ping require different options and parsing on different OSs. When upgrading versions of the probe software, processing code may need to be modified because of output format changes. Not only must upgrade monitoring host probe software, but also target host server versions Being able to track what is working and what is not working and troubleshooting when code performance changes for the worse.

Implementation Challenges Which versions of gnuplot and drivers, MySQL and perl are available? Do they meet our needs? Keeping the servers alive (target kit) Monitoring and target hosts losing disks or having the OSs upgraded. Maintaining proper TCP buffer sizes

Implementation Challenges Many probes have to be done in a synchronous fashion. Do not run iperf, thrulay, and pathload at the same time. Do not want to overload the network with probing activities – this constrains the number and frequency of probes that can be made Currently high impact probes are short (20 seconds or less) and code only allows at most one probe to run within a minute. If a process (probe, script, gnuplot, etc.) cannot hang…it will hang – Time everything out and watch for hangs so they can be automatically cleaned up.

Current Implementation MySQL tables for all configuration information NODES – contains node definitions and path information for that node; all nodes, target and monitoring hosts are defined in this table MONHOST – monitoring host specific information and plotting spec for all the data TOOLSPECS – specification for each probe as well a plotting spec for the data and ‘last run’ field. PLOTSPECS – miscellaneous plotting specifications (scatterplots, timeseries plots, other plot types)

Current Implementation MySQL tables for data storage ABWEDATA – being discontinued (first data table) BWDATA – All bandwidth data is stored here contains fields for: RTT min, max, average, standard deviation Thruput min, max, average, standard deviation, and final throughput Number of streams, windowsize Text results from probe Time of probe Not all fields used for all data types

Current Implementation Tables for Traceroute data ROUTENO – each route seen is given a unique identifier(routeno), and the row contains srcnode, destnode, firstseen, lastseen, ip hop list ROUTEDATA – routeno (from ROUTENO table), text of traceroute output, number of hops, ip hop list, time of probe Historical route data may be interesting to analyze for route changes over time, but no one has had the time or interest to do it. NEW Coming Soon: ASN tables to store ASN info for hops – this is useful as it speeds up interactive drawing and display, and analysis of the traceroutes

Current Implementation SCHEDULE table holds the scheduling information for each probe, and tracks what state it is in. Each and every probe made (including ping) has a unique schedule ID which identifies the probe and all the parameters of the probe Scheduler checks the TOOLSPEC table to ascertain what probes are due to be run and inserts them in the SCHEDULE table Scheduled probes are only run if they are within the “current” time period. This prevents a large number of probes from being stacked up and flooding the network for a long time.

Trouble Shooting Every script has a log file where it records errors and performance information such as how long it took to make a pass. These log files are rotated nightly, and kept for 7 days (easily changed) Hanging probes are a fact of life. Timeout all probes Create a cleanup script that looks for processes which have been active longer than they should be and kills them

Troubleshooting Lingering tasks report – A report showing schedule probes that were not run is generated every day. This is important, as if there are many probes not being run in the nick of time, it may mean that too many are being scheduled to run or that there is a performance problem. Logging Report – A report showing the number of successful probes made, data base write failures, and other failure modes is generated. The info for this report is taken from the data logging log files.

Troubleshooting NETFLOW records are valuable tool Code running fine for years TCP orphan sockets messages crashed machine Netflow records for some 20 second iperf probes were lasting for > 1 minute (some 4 minutes) Change in behavior from the past – were lasting seconds Disabled iperf probes and system stabilized Now need to figure out what goes on with iperf probes…not all troublesome, just a few nodes

Performance Issues When probes show degradation in network performance Is it the network? Is it the monitoring node? – JAVA very bad experience Is it the target node? Recommendation: Have a local target host as a sanity check – also good to use as a target host from other monitoring hosts The monitoring hosts should be dedicated systems Monitor monitoring host load with Lisa, Ganglia, Nagios, APmon to MonALISA, etc.

Performance Issue Example – Bad JAVA Program Caltech monitoring host as seen from iepm- CALTECH target host as seen from iepm- SLAC target host as seen from iepm-

Problems Node name disappears from DNS Ports get suddenly blocked Disks crash (lost the entire CALTECH data base – backup was on same physical disk) – need separate physical disk for local backup Monitoring and target hosts get OS upgrades without warning installed code disappears Data bases get zapped We are now working on backing up data bases and source code configuration information to SLAC once a day. Utility packages (gnuplot, for example) get silently upgraded Discussion about distributing our own

Future Directions Automate installation and configuration process Manage code with CVS and distribute via pacman cache Deploy IEPM-BW for LHC monitoring – see if it is useful and/or relevant – if so, it can be expanded and developed to meet changing needs Upload monitoring data and alerts to MonALISA Implement OWAMP and BWCTL Look at Pathneck Implement min and max (maybe also average) RTT analysis and integrate it with other change analysis

Summary Are you monitoring to determine problems or monitoring for forecasting? They are very different but can both be done with same monitoring With respect to real disk to disk transfers – the disk latency is the overwhelming factor. The monitoring can tell you how the network is performing, but this is not necessarily related to application performance. Bearing this in mind, I do not think we need to perform disk to disk transfers with the monitoring systems or intensive network testing Be prepared to be flexible in your architecture. Networks themselves are constantly evolving and so the probes, analysis, and presentation must also evolve.

Finally What would I have done differently along the way? In hindsight, not a lot. It has been a constant process of learning. The code adapted fairly well to the research we needed to do – Remember it started as an exhibit for SC2001 and has been a research and learning tool since then. More manpower would have been very useful and if it had been available, the code, package structure and the documentation would be more professional, and the change analysis and prediction/forecasting would be more complete.

References: bw.slac.stanford.edu/slac_wan_bw_tests.htmlhttp:// bw.slac.stanford.edu/slac_wan_bw_tests.html Papers/web pages on web100, netflow, and active measurement correlation: Recommended monitoring and target host configurations IEPM-BW Installation and PLM (being updated and reorganized)IEPM-BW Installation and PLM Contributors: Les Cottrell, Jerrod Williams, Mahesh Chhaparia, I-Heng Mei, Manish Bhargava, Jiri Navratil, Yee Ting-Li, all at SLAC now or in the past; Maxim Grigoriev(FNAL), and developers of the probes we use. QUESTIONS?