Comparative Analysis of Internet Topology Data Sets Jay Thom
Outline Introduction Problem Statement Methodology Conclusion 2
Introduction What is Internet Topology? Why measure the Internet? How is this done?
Topology Data Sets Caida-Archipelago (Ark) Measurement Lab (M-Lab) Ripe NNC Atlas University of Washington iPlane ISI Ant Census Internet Research Lab (IRL) CIDR
The Problem… Big problem: Smaller problem: What does the Internet look like right now? Smaller problem: Acquire data to infer this topology Collect data Recurring collection Python vs. C/C++ Parse data Collect statistical information Make comparisons
Data Collection Data stored in numerous formats… Ripe - .json files at anchors Ark - .warts (scamper), compressed binary files iPlane - compressed binary files, iPlane.c M-Lab - Google cloud storage, nested compressed files Ant Census - Released every 2 months UCSD CAIDA (BGP Data) - compressed text files CIDR – compressed text files IRL – (BGP Data) - compressed text files Retrieve traceroute files as needed by date Python vs. C/C++
Data Cleaning and Parsing Remove all un-necessary information Parse data into a common format Store in a consistent manner 30-day set vs. 5-day set 30-day set = 1TB 5-day set = 181GB reduce size to save time
Total Unique Source/Destination IP Addresses For each data source, how many unique source or destination IP addresses are found? This will indicate the number of vantage points or targets the data source has access to. Question: does the number of vantage points/targets affect how much of the Internet a source can see? Question: what is the relationship between number of vantage points/targets and the number of unique traces, unique IP addresses, and unique edges found?
Total Unique Traces How many unique traces is each data source able to find? Why would one source find more than another? What mechanisms are present that would affect these numbers?
Total Unique Edges Visited Question: what does an edge represent? Connection between two routers An ingress/egress point between two ASes
IP/Trace Counts Number of unique IP addresses vs. all collected IPs Number of unique traces vs. all collected traces Question: Why is this important? How many times is a data source repeating the same measurements? How many duplicated efforts are seen? Why would this be?
Problem - Unresponsive Routers
Problem - Unresponsive Routers Count as an edge? Keep, or disregard? If kept, how should they be noted?
Distribution per Source/Destination IP Find the distribution of our data points per source and per destination Analyze this to understand the effectiveness of each platform’s approach to measurement IPs Traces Edges Sources Destinations
Firewalls, Loops, Repeated IP Addresses A-B-C-C A-B-C-D-C A-B-C-C-D A-B-B-C
Ripe Atlas Hardware Rack mounted anchor Small probe (connected anywhere)
Ripe Atlas: User Defined Measurements
Trace IP in traces not seen in Ant Census Question: will some IP addresses be discovered in traces that were not found in the Ant census? Some addresses will respond to ICMP time exceeded that will not respond to ICMP echo request New IP addresses will be discovered that can then be used as active target IP addresses for future probes
Prefix Announcements vs. Mask Distribution Question: What is the distribution of subnets that are announced by each AS data source: CAIDA, IRL, CIDR Why do some perform better than others?
Conflicts in Subnet Announcements 173.246.82.76/30 173.246.82.76/29 173.246.82.76/29 173.246.82.76/28 173.246.82.76/28 173.246.82.76/28 173.246.82.76/28 Determine total number of subnets announced Combine all smaller subnets to see if they make up a complete larger subnet Compare larger subnets to see if they are announced by more than one AS in a data set Analyze to determine if sources clean up conflicts
Trace Data Coverage by BGP Data Not all IP addresses found in our trace data will be visible by our AS data sources BGP data comes from RouteViews project, Univ. of Oregon May not see addresses, say somewhere in Asia Track statistics on IP addresses not found by data sources Track AS coverage per data source Track total number of prefixes announced by data source
AS Rank by Origin, Destination, IP, Edge Rank ASes by the number of data points found in each per data source Compare coverage of ASes by each trace data source Question: Why are some ASes more visible to sources than others?
AS Coverage by Origin, Destination, IP, Edge Track numbers of source iPs, destination IPs, total numbers of IPs, and total edges per AS by data source Rank sources based on these values (which source sees how many ASes per value) Create visual graphs of these ASes, collect and analyze graph data such as degree, centrality, etc. (use tool from CAIDA)
Conclusion Problem Statement Methodology Collection Parsing Statistics Analysis Problems
Questions?
Thanks