IEPM-BW (or PingER on steroids) and the PPDG

IEPM-BW (or PingER on steroids) and the PPDG
Les Cottrell – SLAC Presented at the PPDG meeting, Toronto, Feb 2002 Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM). Supported by IUPAP. PPDG collaborator.

Overview Main issues being addressed by project
Other active measurement projects & deployment Deliverables from IEPM-BW Initial results Experiences Forecasting Passive measurements Next steps Scenario People involved from SLAC include: Les Cottrell, Warren Matthews, Connie Logg, Jerrod Williams, Manish Bhargava, Jiri Navratil

IEPM-BW: Main issues being addressed
Provide a simple, robust infrastructure for: Continuous/persistent and one-off measurement of high network AND application performance management infrastructure – flexible remote host configuration Optimize impact of measurements Duration, frequency of active measurements, and use passive Integrate standard set of measurements including: ping, traceroute, pipechar, iperf, bbcp … Allow/encourage adding measure/app tools Develop tools to gather, reduce, analyze, and publicly report on the measurements: Web accessible data, tables, time series, scatterplots, histograms, forecasts … Compare, evaluate, validate various measurement tools and strategies (minimize impact on others, effects of app self rate limiting, QoS, compression…), find better/simpler tools Provide simple forecasting tools to aid applications and to adapt the active measurement frequency Provide tool suite for high throughput monitoring and prediction Better/simpler includes lower impact, easier to use (e.g. no need to have accounts at remote sites, no need for servers), understand variance and realms of applicability better etc.

Other active measurement projects
Surveyor RIPE AMP PingER NIMI IEPM-BW Community I2 Europe ISPs NSF HENP/ ESnet/ ICFA Research HENP/ PPDG/ Grid Coverage Mainly US Mainly Europe 72 Countries US, CA, JP, NL, CH, UK, FR Metrics One way delay, loss RTT, loss RTT, loss, net thru, mcast RTT, loss, net & app thru Persistent Yes* Yes On demand Topo Mesh Hierarchy Data acc. Request Member Public Surveyor has not been taking data for many months due to a problem in the GPS clock that is being addressed. This does not include various proposals for new projects including some from NLANR, Internet 2 It also does not include the Network Weather Service (NWS). This is aimed at forecasting and provides forecasting for available cpu, current cpu, latency, throughput and memory. For bandwidth its estimates at the moment appear to be limited to between about 3 sites (UIUC, UCSD, UTK)

IEPM-BW Deployment in PPDG
PingER AMP Surveyor RIPE NIMI Trace IEPM-BW SLAC Yes LBNL yes UWisc FNAL ANL BNL JLAB Caltech SDSC The deployment of numerous projects should be leveraged. PingER is available but not sufficient for the grid. AMP less available but also probably insufficient. Concerns over support for Surveyor. RIPE not widely deployed but easy to get data directly. NIMI is our hope but difficult to contribute. CERN, IN2P3, INFN(Milan, Rome, Trieste), KEK, RIKEN, NIKHEF, DL, RAL, TRIUMF GSFC, LANL, NERSC, ORNL, Rice, Stanford, SOX, UDelaware, UFla, Umich, UT Dallas +

IEPM-BW Deliverables Understand and identify resources needed to achieve high throughput performance for Grid and other data intensive applications Provide access to archival and near real-time data and results for eyeballs and applications: planning and expectation setting, see effects of upgrades assist in trouble-shooting problems by identifying what is impacted, time and magnitude of changes and anomalies as input for application steering (e.g. data grid bulk data transfer), changing configuration parameters for prediction and further analysis Identify critical changes in performance, record and notify administrators and/or users Provide a platform for evaluating new SciDAC & base program tools (e.g. pathrate, pathload, GridFTP, INCITE …) Provide measurement/analysis/reporting suite for Grid & hi-perf sites

Results so far 1/2 Reasonable estimates of throughput achievable with 10 sec iperf measurements Multiple streams and big windows are critical Improve over default by 5 to 60. There is an optimum windows*streams Continuous data at 90 min intervals from SLAC to 33 hosts in 8 countries since Dec ‘01

Results so far 2/2 1MHz ~ 1Mbps Bbcp mem to mem tracks iperf
BBFTP & bbcp disk to disk tracks iperf until disk performance limits High throughput affects RTT for others E.g. to Europe adds ~ 100ms QBSS helps reduce impact Archival raw throughput data & graphs already available via http 80 Disk Mbps Iperf Mbps 400

Forecasting Given access to the data one can do real-time forecasting for TCP bandwidth, file transfer/copy throughput E.g. NWS, Predicting the Performance of Wide Area Data Transfers by Vazhkudai, Schopf & Foster Developing simple prototype using average of previous measurements Validate predictions versus observations Get better estimates to adapt frequency of active measurements & reduce impact Also use ping RTTs and route information Look at need for diurnal corrections Use for steering applications Working with NWS for more sophisticated forecasting Can also use on demand bandwidth estimators (e.g. pipechar, but need to know range of applicability)

Forecast results Predict=Moving average of last 5 measurements +- s
Iperf TCP throughput SLAC to Wisconsin, Jan ‘02 100 Mbits/s x Observed Predicted 60 % average error = average(abs(observe-predict)/observe) 33 nodes Iperf TCP Bbcp mem Bbcp disk bbftp pipechar Average % error 13% +- 11% 23% +- 18% 15% +-13% 14% +-12% +-8% Looking at Exponentially weighted moving averages

Passive (Netflow) data
Use Netflow measurements from border router Netflow records time, duration, bytes, packets etc./flow Calculate throughput from Bytes/duration for big flows Validate vs. iperf

Experiences so far (what can go wrong, go wrong, go wrong, go wrong, go wrong, …)
Getting ssh accounts and resources on remote hosts Tremendous variation in account procedures from site to site, takes up to 7 weeks, requires knowing somebody who cares, sites are becoming increasingly circumspect Steep learning curve on ssh, different versions Getting disk space for file copies (100s Mbytes) Diversity of OSs, userids, directory structures, where to find perl, iperf ..., contacts Required database to track Also anonymizes hostnames, tracks code versions, whether to execute command (e.g. no ping if site blocks ping) & with what options, Developed tools to download software and to check remote configurations Remote server (e.g. iperf) crashes: Start & kill server remotely for each measurement Commands lock up or never end: Time out all commands Some commands (e.g. pipechar) take a long time, so run infrequently AFS tokens to allow access to .ssh identity timed out, used trscron Protocol port blocking Ssh following Xmas attacks; bbftp, iperf ports, big variation between sites Wrote analyses to recognize and worked with site contacts Ongoing issue, especially with increasing need for security, and since we want to measure inside firewalls close to real applications Simple tool built for tracking problems Increasingly, sites are tying down ACLs to allow port access only between specified hosts. Just keeping track of passwords is a headache.

Next steps Develop/extend management, analysis, reporting, navigating tools – improve robustness, manageability, optimize measurement frequency Understand correlations & validate various tools Tie into PingER reporting (in beta) Improve predictors and quantify how they work, provide tools to access Tie in passive Netflow measurements Add gridFTP (with & new BW measurers and validate – with Make data available via http to interested & “friendly” researchers CAIDA for correlation and validation of Pipechar & iperf etc. (sent documentaion) NWS for forecasting with UCSB (sent documentation) ANL (done) Make data available by std methods (e.g. MDS, GMA) – with Make tools portable, set up other monitoring sites, e.g. PPDG sites Work with NIMI/GIMI to deploy dedicated engines More uniformity, easier management, greater access granularity & authorization Still need non dedicated: Want measurements from real application hosts, closer to real end user Some apps may not be ported to GIMI OS Not currently funded for GIMI engines Use same analysis, reporting etc. Installing globus had to work around an ANS interaction problem (cannot put personal certificate in AFS). Automating gridFTP measurements hard requires a passphrase to be entered need globus certificate few sites have CA fleshed out some sites do not accept globus certificates requires server to be started (like iperf, bbftp can autostart server, bbcp is per to peer) if use multiple streams it currently does not report the rate, await new release Have gridFTP running from SLAC to BNL and ANL, looking at IN2P3. “Friendly” means understanding, tolerant, willing to provide feedback, possibly develop tools to enhance etc. N.b. Currently NWS does not have any application forecasting The people collaborating on this effort from outside SLAC include: Margaret Murray from CAIDA, Dantong Yu from BNL, Guojun Jin from LBNL, Rich Wolski from UCSB/NWS, Rich Baraniuk, Rolf Riedi from Rice

Scenario BaBar user wants to transfer large volume (e.g. TByte) of data from SLAC to IN2P3: Select initial windows and streams from a table of pre-measured optimal values, or use an on demand tool (extended iperf), or reasonable default if none available Application uses data volume to be transferred and simple forecast to estimate how much time is needed Forecasts from active archive, Netflow, on demand use one-end bandwidth estimation tools (e.g. pipechar, NWS TCP throughput estimator) If estimate duration is longer than some threshold, then more careful duration estimate is made using diurnal forecasting Application reports to user who decides whether to proceed Application turns on QBSS and starts transferring For long measurements, provide progress feedback, using progress so far, Netflow measurements of this flow for last few half hours, diurnal corrections etc. If falling behind required duration, turn off QBSS, go to best effort If throughput drops off below some threshold, check for other sites Can’t do iperf to random site since do not expect there to be an iperf server there. Not confident that current single site bandwidth measurers are accurate enough for FTP type predictions

More Information IEPM/PingER home site: IEPM/BW site
www-iepm.slac.stanford.edu/ IEPM/BW site www-iepm.slac.stanford.edu/bw Bulk throughput site: www-iepm.slac.stanford.edu/monitoring/bulk/ SC2001 & high throughput measurements www-iepm.slac.stanford.edu/monitoring/bulk/sc2001/ QBSS measurements www-iepm.slac.stanford.edu/monitoring/qbss/measure.html Netflow

IEPM-BW (or PingER on steroids) and the PPDG

Similar presentations

Presentation on theme: "IEPM-BW (or PingER on steroids) and the PPDG"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IEPM-BW (or PingER on steroids) and the PPDG

Similar presentations

Presentation on theme: "IEPM-BW (or PingER on steroids) and the PPDG"— Presentation transcript:

Similar presentations

About project

Feedback