Automatically Inferring Patterns of Resource Consumption in Network Traffic Cristian Estan, Stefan Savage, George Varghese University of California, San Diego
Who is using my link? What problem does our paper focus on? We want to help network administrators and operators answer a question they ask pretty often: who is using my link? We have this important resource, say the link between a campus and the rest of the world and we want to know what kind of traffic mix it carries. November 13, 2018 Traffic Clusters - 2003
Too much data for a human Looking at the traffic Too much data for a human Do something smarter! Our imaginary network administrator tried looking at the traffic directly, but he soon realized that this is not the way to go. So he got some software for summarizing the traffic instead. November 13, 2018 Traffic Clusters - 2003
Looking at traffic aggregates Src. IP Src. port Src. net Dest. port Dest. IP Dest. net Protocol Dest. IP Dest. net Source port Rank Destination IP Traffic 1 jeff.dorm.bigU.edu 11.9% 2 tracy.dorm.bigU.edu 3.12% 3 risc.cs.bigU.edu 2.83% Aggregating on individual packet header fields gives useful results but Traffic reports are not always at the right granularity (e.g. individual IP address, subnet, etc.) Cannot show aggregates defined over multiple fields (e.g. which network uses which application) The traffic analysis tool should automatically find aggregates over the right fields at the right granularity Which network uses web and which one kazaa? Rank Source port Traffic 1 Web 42.1% 2 Kazaa 6.7% 3 Ssh 6.3% Rank Destination network Traffic 1 library.bigU.edu 27.5% 2 cs.bigU.edu 18.1% 3 dorm.bigU.edu 17.8% Where does the traffic come from? …… What apps are used? The current solution used by network administrators is to look at aggregates. For example this first table shows the IP addresses that receive the most traffic. This is useful information, but we should be careful not to read too much into it. Our network administrator concludes that most traffic goes to the dorms. If we aggregate along the destination IP field but at the granularity of networks, not individual IP addresses we see he's not right because both the library and the CS department receive more traffic than the dorms. But we have other questions too. To figure out what applications are used we aggregate by source port. We also want to know where the traffic comes from and by the time we answer all the questions we have one table or more for each field in the packet header. But does this really answer all the questions? None of these tables tells us which network uses web and which one kazaa. So, while aggregating on individual header fields gives us useful results it has its problems. As we saw with the destination IP report, the aggregation is not always at the right granularity. The fundamental limitation of these reports however is that they cannot show aggregates over multiple fields. Based on these single field reports we cannot answer questions such as which network uses which application. What our administrator would really like is for the traffic analysis software to automatically find aggregates over the right fields at the right granularity. Most traffic goes to the dorms … November 13, 2018 Traffic Clusters - 2003
Ideal traffic report Web is the dominant application Traffic aggregate Traffic Web traffic 42.1% Web traffic to library.bigU.edu 26.7% Web traffic from www.schwarzenegger.com 13.4% ICMP traffic from sloppynet.badU.edu to jeff.dorm.bigU.edu 11.9% Web is the dominant application This is a Denial of Service attack !! The library is a heavy user of web A traffic report to make him happy could look something like this. The first line clearly identifies web as the dominant application. The second one points out that the library is a heavy user of web. From the third line our network administrator concludes that for some reason or another many people are looking at Schwarzenegger's web site now. The fourth leads to the conclusion that there is an ICMP denial of service attack against a victim in the dorms. Seeing the gap between the types of traffic reports network administrators have and what they would like we decided to try to help them. This paper describes the method we propose for turning raw traffic data into reports that readily reveal the important characteristics of the traffic mix. That’s a big flash crowd! This paper is about giving the network administrator insightful traffic reports November 13, 2018 Traffic Clusters - 2003
Contributions of this paper Approach Definitions Algorithms System Experience I’m going to structure the rest of the talk around the contributions of this paper. We have five important ones. The first one is articulating a novel approach to describing traffic mixes. The next one is coming up with concrete definitions that materialize this approach. We also have algorithms to compute traffic reports conforming to our definitions. We have a prototype system that embodies these ideas. We confirmed its usefulness by running it on traces from various networks and I am going to present some of our findings. November 13, 2018 Traffic Clusters - 2003
Approach Characterize traffic mix by describing all important traffic aggregates Multidimensional aggregates (e.g. flash crowd described by protocol, port number and IP address) Aggregates at the the right level of granularity (e.g. computer, subnet, ISP) Traffic analysis is automated – finds insightful data without human guidance We saw earlier that the important traffic patterns are varied and dynamic – we have flash crowds, denial of service attacks, worms – and we want to know about them all when they happen use our network. So our description of the traffic mix will identify all important traffic aggregates. We must be able to describe multidimensional aggregates: aggregates that are defined through more than one field in the packet header. We also want our description of the traffic mix to be at the right granularity. And of course we want the process to be automatic. The less involvement it requires from the busy network administrator, the better. November 13, 2018 Traffic Clusters - 2003
Definition: traffic clusters Traffic clusters are the multidimensional traffic aggregates identified by our reports A cluster is defined by a range for each field The ranges are from natural hierarchies (e.g. IP prefix hierarchy) – meaningful aggregates Example Traffic aggregate: incoming web traffic for CS Dept. Traffic cluster: ( SrcIP=*, DestIP in 132.239.64.0/21, Proto=TCP, SrcPort=80, DestPort in [1024,65535] ) Traffic aggregate is a fuzzy term. In this paper we use traffic clusters as a formal term for traffic aggregates. A traffic cluster roughly matches the ACL rule you could use in a router or firewall to filter a given aggregate, except that the ACL rules are specified by the network administrator whereas traffic clusters are found through an analysis of the traffic. A cluster is defined through a range for each one of the five header fields we consider. To obtain meaningful traffic clusters, we restrict the ranges used to natural hierarchies associated with each field. For example the source IP address can be defined by any valid IP prefix with length from 32 to 8 or 0. For port numbers we can have the natural ranges of low ports below 1024 and high ports above 1024. Let’s look at an example. The aggregate we want to say something about is the incoming web traffic for the computer science department. The cluster has a “don’t care” range for the source IP address because packets that are part of this aggregate can come from any IP address, it has the prefix used by the computer science network as destination IP range because all packets that are part of this aggregate have destination IP addresses in this prefix, the protocol must be TCP, the source port 80 and the destination port be in the range 1024 to 65535 because web browsers use ports in this range. One important thing to note here is that all hierarchies contain the “don’t care” pattern. So a cluster can ignore any of the fields by simply using * for the corresponding range (like we did with the source address in this example). November 13, 2018 Traffic Clusters - 2003
Definition: traffic report Traffic reports give the volume of chosen traffic clusters To keep report size manageable describe only clusters above threshold (e.g. H=total of traffic/20) To avoid redundant data compress by omitting clusters whose traffic can be inferred (up to error H) from non-overlapping more specific clusters in the report To highlight non-obvious aggregates prioritize by using unexpectedness label Example 50% of all traffic is web Prefix B receives 20% of all traffic The web traffic received by prefix B is 15% instead of 50%*20%=10%, unexpectedness label is 15%/10%=150% A traffic report describes a traffic mix by listing chosen aggregates from within the mix and giving their traffic. We can measure the traffic of an aggregate in bytes or packets. But which aggregates should we include in the description of the mix? To keep the size of the report manageable we only include clusters whose traffic is above a configurable threshold. This exploits the observation that usually the large aggregates are the important ones. But even with thresholding the size of the report can be too large because it often includes redundant information. For example if an IP address has traffic above the threshold, the prefix of length 31 that includes it will be above the threshold too, and so will that of length 30, and all of them will be put into the report even if they contain no traffic other than that of the high volume IP address. Having multiple dimensions only makes this problem worse. We can reduce the size of the report considerably by omitting these redundant clusters. Our compression rule is that if the traffic of a large cluster X is within a certain absolute error of the sum of the traffic of more specific clusters that don’t overlap and are in the report, we can omit cluster X because the reader can infer its traffic based on what is in the report already. We use the same value as the threshold H for the acceptable absolute error. We can improve the readability of the report by prioritizing the clusters in it. We do this using unexpectedness labels that are best explained through an example..... I’d like to point out that even though thresholds have been used before in data mining when solving the related problem of finding association rules, compression and the unexpectedness labels are novel. November 13, 2018 Traffic Clusters - 2003
Contributions of this paper Approach Definitions Algorithms System Experience Let’s go on now to the third contribution of this paper, the algorithms we have for producing traffic reports. November 13, 2018 Traffic Clusters - 2003
Algorithms and theory Algorithms and theoretical bounds in the paper Unidimensional reports are easy to compute Multidimensional reports are exponentially harder as we add more fields Next few slides Example of unidimensional compression Example for the structure of the multidimensional cluster space Our algorithms take raw traffic data such as packet traces or NetFlow records, compute the traffic of various traffic clusters present in the traffic and apply thresholding, compression and unexpectedness labels to produce the lists of clusters that make up various types of reports. We also have bounds on the sizes of these reports. Unidimensional reports are easy to compute, but multidimensional reports are exponentially harder as we add more fields. Instead of talking about these algorithms, I will present two examples in the next few slides: an example of unidimensional compression and one for the structure of the multidimensional cluster space our algorithms operate in. November 13, 2018 Traffic Clusters - 2003
Unidimensional report example Threshold=100 Hierarchy 10.0.0.2/31 10.0.0.4/31 50 10.0.0.8/31 10.0.0.10/31 70 270 35 75 10.0.0.0/30 10.0.0.4/30 10.0.0.8/30 305 10.0.0.0/29 10.0.0.8/29 120 380 10.0.0.0/28 500 500 120 380 305 270 160 110 10.0.0.12/30 First we'll look at how we can compute an uncompressed unidimensional traffic report on the source IP of the packets. Using a hash table we already collected the number of packets sent by each active IP address, and those are the numbers within the circles at the bottom of the page. While these numbers do describe the traffic mix accurately, using them as a traffic report does not provide that much insight. From here it is easy to compute the traffic of all the prefixes in the IP hierarchy. After we apply the threshold we keep only the prefixes whose traffic is above 100. We can see how this ensures that we are looking at traffic at the right granularity. This IP address which has large traffic appears on its own in the report whereas this other one with small traffic doesn't. However a more general prefix that has enough traffic does get included. 10.0.0.14/31 15 35 30 40 160 110 35 75 10.0.0.2 10.0.0.3 10.0.0.4 10.0.0.5 10.0.0.8 10.0.0.9 10.0.0.10 10.0.0.14 November 13, 2018 Traffic Clusters - 2003
Unidimensional report example Compression 270 120 500 305 380 160 110 10.0.0.8/31 10.0.0.8/30 10.0.0.0/28 Source IP Traffic 10.0.0.0/29 120 10.0.0.8/29 380 10.0.0.8 160 10.0.0.9 110 120 380 380-270≥100 10.0.0.0/29 10.0.0.8/29 305-270<100 Once we have this tree of clusters with traffic over 100, we can apply compression. The nodes that are leaves of this tree obviously have to be in the compressed report because there are no more specific large clusters that could describe their traffic. This node has no reason to be in the report because its traffic matches exactly the traffic of its two children that are in the report already. This one has some more traffic, but the difference is not large enough to warrant its inclusion. This node has traffic that exceeds the traffic of the two more specific ones by more than our threshold,100, so we do include it into the compressed report. We put these clusters in tabular form and that’s our traffic report. Note that the compressed traffic report for both the unidimensional and the multidimensional case is guaranteed to describe all aggregates with traffic above the threshold, furthermore it also gives us the traffic of each of these aggregates within an absolute error smaller than the threshold. 160 110 10.0.0.8 10.0.0.9 November 13, 2018 Traffic Clusters - 2003
Multidimensional structure ex. Nodes (clusters) have multiple parents Nodes (clusters) overlap All traffic US EU CA NY GB DE Web Mail Source net Application US CA Web While unidimensional reports are useful, often important aggregates can be revealed only through multidimensional analysis. Let's look now at the structure of a simple multidimensional cluster space. We have two small unidimensional hierarchies. Based on the source address we can divide the traffic into traffic that comes from the US network and traffic that comes from the European network. Furthermore we can divide the traffic from the US network into traffic from California and traffic from New York and the European traffic into traffic from Great Britain and traffic from Germany. The other classification is based on port numbers and divides the traffic into web and email. We should think of these hierarchies as orthogonal, because they are. We combine them so that there is a single node representing the total traffic now, but this is still not the space of multidimensional clusters. To get there we should add the web traffic of each of the networks and their email traffic too. This semi lattice is the structure of this simple multidimensional cluster space. We can immediately notice some important properties of this space that make multidimensional algorithms harder. First of all it is much bigger than the unidimensional hierarchies it is based on making thresholding and compression the more important. Also nodes have multiple parents. For example The node denoting the US web traffic has one parent along each dimension: the total US traffic and the total web traffic. Also clusters overlap without one including the other like it happened for the hierarchy trees. The California traffic and the US web traffic overlap this way. I encourage you to read the paper if you want to find out about algorithms that find the clusters above the threshold and perform compression on them, but now I'll move on to the next section of the talk. US Web November 13, 2018 Traffic Clusters - 2003
Contributions of this paper Approach Definitions Algorithms System Experience Which is about our prototype system that does traffic analysis using multidimensional clusters. November 13, 2018 Traffic Clusters - 2003
System: AutoFocus Cluster miner names Web based GUI Grapher Traffic categories names Web based GUI Grapher Traffic parser Our prototype system is called AutoFocus. Its core is the cluster miner that computes the clusters above the threshold and applies compression and unexpectedness labels to the traffic reports. The web based graphical user interface displays them. It gets its input from the traffic parser which reads in packet header traces together with the time series plots of the traffic generated by the third party grapher. The system can operate offline or semi real-time. The system is fully automatic except for two enhancements based on operator input. We can add names to the system so that it knows that protocol 17 is UDP, port 80 is web and say prefix 10.0.64.0/19 is the library. The other enhancement is to break down the traffic into disjoint categories each defined as one or more clusters. For example we could have separate categories for the web traffic going to the library, the kazaa traffic going to the dorms and one for the ICMP traffic. Next I’m going to show you some traffic reports generated by AutoFocus for a trace from a network access point called SD-NAP located close to UCSD. Packet header trace November 13, 2018 Traffic Clusters - 2003
This report is for the week from the 15th to the 21st of December 2002 This report is for the week from the 15th to the 21st of December 2002. We have separate reports for each day and we can also drill down to three hour intervals and half hour intervals. This particular report measures the traffic in bytes, but AutoFocus also generates reports that measure the traffic in packets or in flows. The threshold is set to 5% which for this particular week comes to slightly above 70 gigabytes. The actual traffic report starts with five unidimensional reports, one for each field. Let me bring up the source IP report here. We can see how AutoFocus automatically finds the right granularity to describe the traffic at. For example this IP address made it into the traffic report on its own, while here we have a whole /19. These clusters are the right granularity to describe the traffic because they are both slightly above 5% of the traffic. We can also see how compression works here. The /16 made it into the report because its traffic, unlike that of the /17 and /18 is larger than that of the /19 by more than the threshold. November 13, 2018 Traffic Clusters - 2003
The unidimensional reports are followed by the multidimensional report which describes in one page all traffic aggregates whose volume is above 5% of the total traffic with an absolute error that never exceeds 70 gigabytes. Let me point out that some clusters have really high unexpectedness values, up to 35,000%. I highlighted here the five clusters that have the highest unexpectedness labels and we'll see soon how informative they are. November 13, 2018 Traffic Clusters - 2003
The final component of the GUI is the time series plot of the traffic The final component of the GUI is the time series plot of the traffic. This shows clearly when the traffic spikes are but gives absolutely no insight into what causes those spikes. To combine multidimensional traffic cluster analysis with these easy to read time series plots, we divide the traffic into categories that we plot with different colors. If we just highlight the five clusters with the highest unexpectedness from the multidimensional report we get a much clearer picture of who causes those spikes in our network traffic. Using the multidimensional reports and his knowledge of the network the administrator can easily break down the regular traffic mix into meaningful categories. Once this is done, the time series plot gives a very easy to read summary of the traffic mix. There is a special category called “other traffic” that contains all traffic that is in none of the high volume meaningful categories that make up the regular traffic mix. Unusual events that generate a lot of traffic will show up as a bulge in this last category. AutoFocus provides reports describing the traffic mixes within each category and this helps pinpoint the causes of the unusual events. November 13, 2018 Traffic Clusters - 2003
Contributions of this paper Approach Definitions Algorithms System Experience Finally let me illustrate with two slides how AutoFocus helped us understand both the regular traffic mix and unexpected events. November 13, 2018 Traffic Clusters - 2003
Structure of regular traffic mix Backups from CAIDA to tape server Semi-regular time pattern FTP from SLAC Stanford Scripps web traffic Web & Squid servers Large ssh traffic Steady ICMP probing from CAIDA SD-NAP SD-NAP The most conspicuous category of traffic we identified were these massive backups. Multidimensional traffic reports identified the source and destination of these backups and also the port numbers. The red backups use ssh (port 22) and the brown ones use high port to high port transfers. Not only we see that backups are massive but we also see their time patterns: regular spikes at midnights, and occasional transfers that last longer like on Sunday and Monday. We also found other categories you can read about in the paper, let me just present a last one here. We found this constant ICMP probing from CAIDA by looking at the report that measures traffic volume in flows and it was the cluster with the highest unexpectedness. Note how the massive backups don't show up at all on this plot because even though they send many bytes, they do so using very few TCP connections. We'd rather expect scans and attacks using random source IP addresses to show up here. So what is this steady ICMP stream coming out of a respectable network? It is the traffic generated by the skitter project that uses active probing to map the Internet. November 13, 2018 Traffic Clusters - 2003
Analysis of unusual events UCSD to UCLA route change Sapphire/SQL Slammer worm Site 2 One nice side effect of categorizing the regular traffic mix is that unusual traffic becomes very conspicuous. For example here the outbreak of the Sapphire worm shows up as a huge bulge in the "other traffic” category in our analysis of this trace from another site. How good is AutoFocus at pointing out the culprit? This is the undoctored multidimensional report generated automatically by our tool. It is rather obvious that there's a lot of UDP traffic going to port 1434. The report actually gives the volume of the worm traffic too. Note that the cluster in this line gives you exactly the ACL rule you should put into your router to filter the worm traffic out. These traffic clusters with high unexpectedness also give us the exact IP addresses within our network that generate worm traffic. So not only does the automatic report give the network administrator the ACL rule for a completely new worm, but it also tells him what computers are infected on his own network. Impressive! November 13, 2018 Traffic Clusters - 2003
Conclusions 1010111101010000101011111101011001010101101011010000101010100101010111101010101000101111010000010111111101011001010111010111100100101010100011011111100010101110110101100101010110101111000010101011110111010111010101010111111010110010101011010101111101010000110100001011010100101011001000000101011001010101011111000010001000010101011110101000010111001010101101011110000010101011111101011000101111010000010111110101011010111100100101010110010101010001010100101010110101010010111001010000010100001110110101010110111111000101011101011101011001010101101011110000110111101110101110101010101111110101100101010110101111011101010000110101010010101101010111010101001010000101011010101001010100000101010101010101101011101010100000010101010101101010101011110101110101011010100011000101010010111010101001101010100001000110101111010100010110 So what did we do in this paper? We showed you how to turn a big pile of raw network measurement data into insightful traffic reports. November 13, 2018 Traffic Clusters - 2003
Conclusions Multidimensional traffic clusters using natural hierarchies describe traffic aggregates Traffic reports using thresholding identify automatically conspicuous resource consumption at the right granularity Compression produces compact traffic reports and unexpectedness labels highlight non-obvious aggregates Our prototype system, AutoFocus, provides insights into the structure of regular traffic and unexpected events Or if we want to be a bit more accurate we can say that multidimensional traffic clusters based on natural hierarchies describe many meaningful traffic aggregates that can be included into traffic reports. Using thresholds in the traffic reports automatically identifies conspicuous resource consumption at the right granularity. Compression cuts down significantly the size of the traffic reports without introducing serious errors. Unexpectedness labels highlight many important non-obvious aggregates. We demonstrated these points with our AutoFocus prototype that gave us valuable insights into the structure of regular traffic and that of unexpected events. November 13, 2018 Traffic Clusters - 2003
Thank you! Alpha version of AutoFocus downloadable from http://ial.ucsd.edu/AutoFocus/ Any questions? Acknowledgements: NIST, NSF, Vern Paxson, David Moore, Liliana Estan, Jennifer Rexford, Alex Snoeren, Geoff Voelker November 13, 2018 Traffic Clusters - 2003
Bounds and running times Report size Running time Memory usage unc. 1dim. rep. ≤1+(d-1)T/H O(n+m(d-1)) O(m(d-1)) 1dim. report ≤ T/H linear 1dim. Δ report ≤T1/H+T2/H unc. +dim. rep. ≤ T/H ∏di ≈result*n O(m+result) +dim. rep. ≤ T/H ∏di/max(di) +dim. Δ report ≈eresult November 13, 2018 Traffic Clusters - 2003
Open questions Are there tighter bounds for the size of the reports? Are there algorithms that produce smaller results? Are there algorithms that compute traffic reports more efficiently? In streaming fashion? November 13, 2018 Traffic Clusters - 2003
Delta reports Why repeat the same traffic report if the traffic doesn’t change from one day to the other? Delta reports describe the clusters that increased or decreased by more than the threshold from one interval to the other On related traffic mixes delta reports much smaller than traffic reports Multidimensional compression very hard for delta reports We have only exponential algorithm for the cluster delta November 13, 2018 Traffic Clusters - 2003
Greedy compression algorithm November 13, 2018 Traffic Clusters - 2003
Multidimensional report example Thresholding Compression November 13, 2018 Traffic Clusters - 2003
System details Part Language LoC Status Backend C++ 5400 stable GUI HTML, Javascript 1000 functional Glue perl 350 evolving System structure Stable backend 5,400 LoC C++ Functional web based GUI HTML and JavaScript 1,000 LoC Evolving perl glue 350 LoC Operation offline or semi real-time Input: packet header traces (NetFlow data soon) Output: traffic reports for multiple timescales (daily, etc.) Automatic except enhancements based on operator input Traffic categories based on clusters Human readable names (e.g. 132.239.51.0/24 is dangernet) November 13, 2018 Traffic Clusters - 2003