MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance Warren Matthews (SLAC) Presented at the Measurement SIG ESCC/Internet2 Techs Workshop. Lawrence, Kansas, August 3-7, 2003.
2 What is MAGGIE? MAGGIE –Measurement and Analysis for the Global Grid and Internet End-to-end performance A proposal for a monitoring infrastructure for network/grid management –DoE/MICS –Worldwide HENP/Grid Closely related to e2epi and PIPES
3 Toward a Monitoring Infrastructure Certainly the need –DOE Science Community –Grid –Troubleshooting –E2Epi has recognized the need. Many of the ingredients –Many monitoring projects, Many tools –PIPES
4 Network Management “Unfortunately, network management research has historically been very under- funded, because it is difficult to get funding bodies to recognize this as legitimate networking research.” Sally Floyd IAB Concerns & Recommendations Regarding Internet Research & Evolution.
5 MAGGIEMAGGIE MAGGIE NIMI Security and scheduling IEPM-BW Measurement Engine Publishing Fault Finding Analysis Engine Other tools NMWG AMP RIPE SLAC FNAL PSCICIR LBNL SLAC ANL SCIDAC UCL
6 IEPM-BWIEPM-BW SLAC package for monitoring and analysis Currently 10 monitoring sites SLAC, FNAL, GATech (SOX), INFN (Milan), NIKHEF, APAN (Japan) UMich, Internet2 (Michigan), UManchester, UCL (Both UK) 2-36 targets
7 SNV SLAC CHI ESnet NY Stanford CalREN NERSC LANL JLAB TRIUMF KEK Abilene SLAC SNV FNAL ANL NIKHEF CERN IN2P3 CERN CALTECH SDSC BNL JAnet HSTN SEA ATL CLV IPLS RAL UCL UManc DL NNW NY Rice UTDallas NCSA UMich I2 SOX UFL APAN RIKEN INFN-Roma INFN-Milan CESnet APAN Geant EDG PPDG/GriPhyN Monitoring Site ORNL
8 Measurement Engine Ping, Traceroute Iperf, Bbftp, Bbcp (mem and disk) Abwe Gridftp, UDPmon Web100 Passive (netflow)
9
10 Analysis Engine Publishing –Usual method is on the web –Too much to view frequently, Time delay (resolve problem before user complains) Alarm System based on based Web/Grid Service –GGF NMWG Schema Detect performance hits without human intervention Find location of fault as a starting point –PIPES contact database
11 TroubleshootingTroubleshooting RIPE-TT Testbox alarm –Rolling average (morning-afternoon- evening-night) AMP Automatic Event Detection –Mean and variance Our approach is diurnal changes –Median and standard deviation of measurements on Monday 7pm-8pm
12 Case Study Iperf measurements between SLAC and Internet2 office in Ann Arbor. Determine base line. What should be considered a performance hit. If you don’t measure, you don’t know (Kevin Walsh)
13
14 Busiest Routes 26,969 traceroutes 116 routes, 9 seen >1%, 3 seen >4% SLAC Stanford CalREN dnvr-sunv kscy-dnvripls-kscyclev-ipls mich.net Thunderbird sunvng-sunv kscyng-sunvng iplsng-kscyng mich.net Route 19 – 3447 (12.8%) Route 60 – 2865 (10.6%) Route 70 – (54.3%)
15
16
17 All Data Good Bad Better
18
19
20 ConclusionConclusion Not a network problem! Systematic error Competing periodic test/transfer Fix it! What if I don’t/can’t? Users will hit this even if I move the test
21 Mean=85.97 Mbps Median=90 Mbps Std Dev=15.1 Mbps
22 Concern Threshold (ctresh ) Median – 1 std dev Alarm Threshold (atresh) Median – 2 std dev Concern Alarm Calm Long Term
23 Calm Concern Alarm
24 To be continued … Determine statistics for all end-to-end pairs Long-term (all data), Medium term (last 30 days) and Short term (last 5 measurements) Not manually.
25 Next Steps Continue publishing Web/Grid Services –GGF/NMWG Workshop Parameters for Autoshooting NetRat (fault finding) Monalisa (visualization) –IEPM-BW is plugged in, also w/service interface Advisor Funding for MAGGIE
26 LinksLinks IEPM-BW ABwE AMP NIMI RIPE-TT SLAC Web Services GGF NMWG AMP TroubleShooting
27 CreditsCredits Les Cottrell, Connie Logg, Jerrod Williams, Jiri Navratil, Fabrizio Coccetti Frank Nagy, Maxim Grigoriev Brian Tierney Eric Boyd, Jeff Boote Vern Paxson, Andy Adams Tony McGregor Iosif Legrand Local Admins and other volunteers DoE/MICS