1University of Adelaide Network Tomography and Internet Traffic Matrices Matthew Roughan School of Mathematical Sciences University of Adelaide
2University of Adelaide Credits vDavid Donoho – Stanford vNick Duffield – AT&T Labs-Research vAlbert Greenberg – AT&T Labs-Research vCarsten Lund – AT&T Labs-Research vQuynh Nguyen – AT&T Labs vYin Zhang – AT&T Labs-Research
3University of Adelaide Want to know demands from source to destination Problem Have link traffic measurements A B C
4University of Adelaide Example App: reliability analysis Under a link failure, routes change want to predict new link loads B A C
5University of Adelaide Network Engineering vWhat you want to do a)Reliability analysis b)Traffic engineering c)Capacity planning vWhat do you need to know üNetwork and routing üPrediction and optimization techniques ? Traffic matrix
6University of Adelaide Outline vPart I: What do we have to work with – data sources uSNMP traffic data uNetflow, packet traces uTopology, routing and configuration vPart II:Algorithms uGravity models uTomography uCombination and information theory vPart III: Applications uNetwork Reliability analysis uCapacity planning uRouting optimization (and traffic engineering in general)
7University of Adelaide Part I: Data Sources
8University of Adelaide Traffic Data
9University of Adelaide Data Availability – packet traces Packet traces limited availability – like a high zoom snap shot special equipment needed (O&M expensive even if box is cheap) lower speed interfaces (only recently OC192) huge amount of data generated
10University of Adelaide Data Availability – flow level data Flow level data not available everywhere – like a home movie of the network historically poor vendor support (from some vendors) large volume of data (1:100 compared to traffic) feature interaction/performance impact
11University of Adelaide Netflow Measurements vDetailed IP flow measurements uFlow defined by 3Source, Destination IP, 3Source, Destination Port, 3Protocol, 3Time uStatistics about flows 3Bytes, Packets, Start time, End time, etc. uEnough information to get traffic matrix vSemi-standard router feature uCisco, Juniper, etc. unot always well supported upotential performance impact on router vHuge amount of data (500GB/day)
12University of Adelaide Data Availability – SNMP SNMP traffic data – like a time lapse panorama MIB II (including IfInOctets/IfOutOctets) is available almost everywhere manageable volume of data (but poor quality) no significant impact on router performance
13University of Adelaide SNMP vPro uComparatively simple uRelatively low volume uIt is used already (lots of historical data) vCon uData quality – an issue with any data source 3Ambiguous 3Missing data 3Irregular sampling uOctets counters only tell you link utilizations 3Hard to get a traffic matrix 3Can’t tell what type of traffic 3Can’t easily detect DoS, or other unusual events uCoarse time scale (>1 minute typically; 5 min in our case)
14University of Adelaide Topology and configuration vRouter configurations uBased on downloaded router configurations, every 24 hours 3Links/interfaces 3Location (to and from) 3Function (peering, customer, backbone, …) 3OSPF weights and areas 3BGP configurations uRouting 3Forwarding tables 3BGP (table dumps and route monitor) 3OSPF table dumps vRouting simulations uSimulate IGP and BGP to get routing matrices
15University of Adelaide Part II: Algorithms
16University of Adelaide The problem Want to compute the traffic x j along route j from measurements on the links, y i router route 2 route 1 route 3
17University of Adelaide The problem y = Ax Want to compute the traffic x j along route j from measurements on the links, y i router route 2 route 1 route 3
18University of Adelaide Underconstrained linear inverse problem y = Ax Routing matrix Many more unknowns than measurements Traffic matrix Link measurements
19University of Adelaide Naive approach
20University of Adelaide Gravity Model vAssume traffic between sites is proportional to traffic at each site x 1 y 1 y 2 x 2 y 2 y 3 x 3 y 1 y 3 vAssumes there is no systematic difference between traffic in LA and NY uOnly the total volume matters uCould include a distance term, but locality of information is not as important in the Internet as in other networks
21University of Adelaide Simple gravity model
22University of Adelaide Generalized gravity model vInternet routing is asymmetric vA provider can control exit points for traffic going to peer networks peer links access links
23University of Adelaide Generalized gravity model peer links access links vInternet routing is asymmetric vA provider can control exit points for traffic going to peer networks vHave much less control over where traffic enters
24University of Adelaide Generalized gravity model
25University of Adelaide Tomographic approach y = A x router route 2 route 1 route 3
26University of Adelaide Direct Tomographic approach vUnder-constrained problem vFind additional constraints vUse a model to do so uTypical approach is to use higher order statistics of the traffic to find additional constraints vDisadvantage uComplex algorithm – doesn’t scale (~1000 nodes, routes) uReliance on higher order stats is not robust given the problems in SNMP data uModel may not be correct -> result in problems uInconsistency between model and solution
27University of Adelaide Combining gravity model and tomography tomographic constraints (from link measurements) 1. gravity solution 2. tomo-gravity solution
28University of Adelaide Regularization approach vMinimum Mutual Information: uminimize the mutual information between source and destination vNo information u The minimum is independence of source and destination 3P(S,D) = p(S) p(D) 3P(D|S) = P(D) 3actually this corresponds to the gravity model uAdd tomographic constraints: 3Including additional information as constraints 3Natural algorithm is one that minimizes the Kullback-Liebler information number of the P(S,D) with respect to P(S) P(D) Max relative entropy (relative to independence)
29University of Adelaide Validation vResults good: ±20% bounds for larger flows vObservables even better
30University of Adelaide More results tomogravity method simple approximation >80% of demands have <20% error Large errors are in small flows
31University of Adelaide Robustness (input errors)
32University of Adelaide Robustness (missing data)
33University of Adelaide Dependence on Topology clique star (20 nodes)
34University of Adelaide Additional information – Netflow
35University of Adelaide Part III: Applications
36University of Adelaide Applications vCapacity planning uOptimize network capacities to carry traffic given routing uTimescale – months vReliability Analysis uTest network has enough redundant capacity for failures uTime scale – days vTraffic engineering uOptimize routing to carry given traffic uTime scale – potentially minutes
37University of Adelaide Capacity planning vPlan network capacities uNo sophisticated queueing (yet) uOptimization problem vUsed in AT&T backbone capacity planning uFor more than well over a year uNorth American backbone vBeing extended to other networks
38University of Adelaide Network Reliability Analysis vConsider the link loads in the network under failure scenarios uTraffic will be rerouted uWhat are the new link loads? vPrototype used (> 1 year) uCurrently being turned form a prototype into a production tool for the IP backbone uAllows “what if” type questions to be asked about link failures (and span, or router failures) uAllows comprehensive analysis of network risks 3What is the link most under threat of overload under likely failure scenarios
39University of Adelaide Example use: reliability analysis
40University of Adelaide Traffic engineering and routing optimization vChoosing route parameters that use the network most efficiently uIn simple cases, load balancing across parallel routes vMethods uShortest path IGP weight optimization 3Thorup and Fortz showed could optimize OSPF weights uMulti-commodity flow optimization 3Implementation using MPLS 3Explicit route for each origin/destination pair
41University of Adelaide Comparison of route optimizations
42University of Adelaide Conclusion vProperties uFast (a few seconds for 50 nodes) uScales (to hundreds of nodes) uRobust (to errors and missing data) uAverage errors ~11%, bounds 20% for large flows vTomo-gravity implemented uAT&T’s IP backbone (AS 7018) uHourly traffic matrices for > 1 year uBeing extended to other networks
43University of Adelaide Additional slides
44University of Adelaide Validation vLook at a real network uGet SNMP from links uGet Netflow to generate a traffic matrix uCompare algorithm results with “ground truth” uProblems: 3Hard to get Netflow along whole edge of network If we had this, then we wouldn’t need SNMP approach 3Actually pretty hard to match up data Is the problem in your data: SNMP, Netflow, routing, … vSimulation uSimulate and compare uProblems 3How to generate realistic traffic matrices 3How to generate realistic network 3How to generate realistic routing 3Danger of generating exactly what you put in
45University of Adelaide Our method vWe have netflow around part of the edge (currently) vWe can generate a partial traffic matrix (hourly) uWon’t match traffic measured from SNMP on links vCan use the routing and partial traffic matrix to simulate the SNMP measurements you would get vThen solve inverse problem vAdvantage uRealistic network, routing, and traffic uComparison is direct, we know errors are due to algorithm not errors in the data
46University of Adelaide Estimates over time
47University of Adelaide Local traffic matrix (George Varghese) for reference previous case 0% 1% 5% 10%