“Uptime” at IXPs - and NIS Directive

Slides:



Advertisements
Similar presentations
Logically Centralized Control Class 2. Types of Networks ISP Networks – Entity only owns the switches – Throughput: 100GB-10TB – Heterogeneous devices:
Advertisements

COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Internet Control Protocols Savera Tanwir. Internet Control Protocols ICMP ARP RARP DHCP.
Why Our Customers Decide to Own Statseeker. The Problem –All other network management tools use traditional monitoring techniques (such as general purpose.
IP Addressing and Network Software. IP Addressing  A computer somewhere in the world needs to communicate with another computer somewhere else in the.
CCNA Introduction to Networking 5.0 Rick Graziani Cabrillo College
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Troubleshooting Your Network Networking for Home and Small Businesses.
ICMP (Internet Control Message Protocol) Computer Networks By: Saeedeh Zahmatkesh spring.
Page 19/13/2015 Chapter 8 Some conditions that must be met for host to host communication over an internetwork: a default gateway must be properly configured.
1 Semester 2 Module 10 Intermediate TCP/IP Yuda college of business James Chen
1 Chapter 1 OSI Architecture The OSI 7-layer Model OSI – Open Systems Interconnection.
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 2 Module 8 TCP/IP Suite Error and Control Messages.
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 2 Module 9 Basic Router Troubleshooting.
Connecting The Network Layer to Data Link Layer. ARP in the IP Layer The Address Resolution Protocol (ARP) The Address Resolution Protocol (ARP) Part.
workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.
BCNET Conference April 29, 2009 Andree Toonk BGPmon.net Prefix hijacking! Do you know who's routing your network? Andree Toonk
CS 453 Computer Networks Lecture 18 Introduction to Layer 3 Network Layer.
ITI-510 Computer Networks ITI 510 – Computer Networks Meeting 3 Rutgers University Internet Institute Instructor: Chris Uriarte.
Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,
D-Link TSD 2009 workshop D-Link Net-Defends Firewall Training ©Copyright By D-Link HQ TSD Benson Wu.
ITI-510 Computer Networks ITI 510 – Computer Networks Meeting 3 Rutgers University Internet Institute Instructor: Chris Uriarte.
Routing Algorithms Lecture Static/ Dynamic, Direct/ Indirect, Shortest Path Routing, Flooding, Distance Vector Routing, Link State Routing, Hierarchical.
Network Layer IP Address.
SESSION HIJACKING It is a method of taking over a secure/unsecure Web user session by secretly obtaining the session ID and masquerading as an authorized.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 2 v3.1 Module 8 TCP/IP Suite Error and Control Messages.
How Website Monitoring Improves Your Website Up-time.
June 11, 2002 Abilene Route Quality Control Initiative Aaron D. Britt Guy Almes Route Optimization.
IP Addressing and Subnetting
Instructor Materials Chapter 7: EIGRP Tuning and Troubleshooting
Internet Control Message Protocol (ICMP)
Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.
Date: April. 13, Monday Evening.
Introduction to Information Security
Internet Control Message Protocol (ICMP)
Will Hargrave // LONAP BGP Session Culling Will Hargrave // LONAP UKNOF37 Manchester
معمل رقم (3) IP Address.
CCENT Study Guide Chapter 9 IP Routing.
Ping and traceroute.
Instructor Materials Chapter 9: Testing and Troubleshooting
ETHANE: TAKING CONTROL OF THE ENTERPRISE
seems like a really round number. It can’t be accurate can it?
10 Ways to Have Infinitely Better Conversations
Introduction to Networking
Lecture 6: TCP/IP Networking By: Adal Alashban
Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.
Introduction to Networking
Byungchul Park ICMP & ICMPv DPNM Lab. Byungchul Park
Lecture 2 Overview.
NET323 D: Network Protocols
René Wilhelm & Henk Uijterwaal RIPE NCC APNIC21, 18 September 2018
Internet Control Message Protocol (ICMP)
Internet Control Message Protocol (ICMP)
Topic 5: Communication and the Internet
Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.
New Solutions For Scaling The Internet Address Space
Internet Control Message Protocol (ICMP)
Internet Control Message Protocol (ICMP)
NET323 D: Network Protocols
“I can’t raise prices because I will lose customers.”
Network Devices Hub Definition:
Lecture 4 Communication Network Protocols
Chapter 7: EIGRP Tuning and Troubleshooting
Networking Theory (part 2)
All goals are not created equally.
Internet Basics Videos
Delivery, Forwarding, and Routing of IP Packets
COS 461: Computer Networks
BGP Instability Jennifer Rexford
Example 9 (Continued) 1. The first mask (/26) is applied to the destination address. The result is , which does not match the corresponding network.
Networking Theory (part 2)
Presentation transcript:

“Uptime” at IXPs - and NIS Directive Robert Lister UKNOF 40 27 April 2018 | Manchester

NIS Directive EU Directive on security of Networks and Information Systems UK Consultation: (August/Sept 2017): https://www.gov.uk/government/consultations/con sultation-on-the-security-of-network-and- information-systems-directive https://www.ncsc.gov.uk/guidance/introduction- nis-directive

NIS Directive May require IXPs to report availability / outage metrics For UK, this means OFCOM: “Operators who have 50% or more annual market share amongst UK IXP Operators in terms of interconnected autonomous systems, Or: Who offer interconnectivity to 50% or more of Global Internet routes.” This is what the UK has settled on for defining the size of IXP. At the time though, there were other criteria being talked about.

Source: https://en.wikipedia.org/wiki/High_availability “High availability” Availability % Downtime per year Downtime per month Downtime per week Downtime per day 90% ("one nine") 36.5 days 72 hours 16.8 hours 2.4 hours 95% ("one and a half nines") 18.25 days 36 hours 8.4 hours 1.2 hours 97% 10.96 days 21.6 hours 5.04 hours 43.2 minutes 98% 7.30 days 14.4 hours 3.36 hours 28.8 minutes 99% ("two nines") 3.65 days 7.20 hours 1.68 hours 14.4 minutes 99.5% ("two and a half nines") 1.83 days 3.60 hours 50.4 minutes 7.2 minutes 99.8% 17.52 hours 86.23 minutes 20.16 minutes 2.88 minutes 99.9% ("three nines") 8.76 hours 43.8 minutes 10.1 minutes 1.44 minutes 99.95% ("three and a half nines") 4.38 hours 21.56 minutes 5.04 minutes 43.2 seconds 99.99% ("four nines") 52.56 minutes 4.38 minutes 1.01 minutes 8.64 seconds 99.995% ("four and a half nines") 26.28 minutes 2.16 minutes 30.24 seconds 4.32 seconds 99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds 864.3 milliseconds 99.9999% ("six nines") 31.5 seconds 2.59 seconds 604.8 milliseconds 86.4 milliseconds 99.99999% ("seven nines") 3.15 seconds 262.97 milliseconds 60.48 milliseconds 8.64 milliseconds 99.999999% ("eight nines") 315.569 milliseconds 26.297 milliseconds 6.048 milliseconds 0.864 milliseconds 99.9999999% ("nine nines") 31.5569 milliseconds 2.6297 milliseconds 0.6048 milliseconds 0.0864 milliseconds “LOL.” “OK.” Source: https://en.wikipedia.org/wiki/High_availability

99.99(9)% uptime? Current network uptime: 99.999% * Could we produce an “uptime” metric for an IXP? Sometimes engineers get asked to generate an “uptime metric”. Wouldn’t it be good if we could have this? Would it?

99.99(9)% uptime? Current network uptime: 99.999% * Of course.. It we may be able to produce such a metric, but it would probably come with a bunch of small print… You would want us to explain how we did it. Would the method we used be reasonable enough to give a meaningful number? Or would it be so difficult to do, we’d have to make a lot of “exceptions” so the number always looks “good”. 9 out of 10 cats local pref our prefixes. The value of your pings may go down as well as up. We reserve the right to replace lost packets with equivalent size packets at our discretion. Not to scale. Not actual web site. Due to rounding, numbers presented may not add up precisely to the totals provided and percentages may not precisely reflect the absolute figures. Figures were correct at time we made them up. Subject to National Rail Conditions of Travel. Packets valid via any reasonable route. Contents may settle during shipping.

Determine “up” at an IXP member ping? 5.57.80.1  5.57.80.2 5.57.80.3 5.57.80.4 5.57.80.5 …etc… 5.57.80.xx IXP Switch monitoring R1 R2 R3 R4 R5 So what are the options for determining what is “up” at an IXP – a simple exchange… If we had a simple infrastructure, the answer might be – just ping every member. Our monitoring server receives a response from every member. We know all is good. We might even record the latency. = “100% up”

Ping all the things… Example: In 24 hours = 1440 minutes. member ping Available % 5.57.80.1  … lots more columns … 100% 5.57.80.2 5.57.80.3 5.57.80.4 5.57.80.5  99.65% Example: In 24 hours = 1440 minutes. -5 minutes downtime = 1435 (99.652%) It would more likely be calculated in seconds: (86400 – 300 = 99.652%) So, we do this a few times, and we begin to build up a picture that we might use to calculate an uptime metric.

Pinging members can suck… 5.57.80.1   5.57.80.2 5.57.80.3 5.57.80.4 5.57.80.5 Some members may have busy routers (high latency/packet loss) Some do not reply to ping Might miss shorter outages between pings Latency is an interesting stat to monitor Ping is a bit sucky, but it does have its uses. Latency can tell us about remotely connected members, for example.

It can get ……. messy   IXP Manager option: member 5.57.80.1 ping 5.57.80.1   5.57.80.2 5.57.80.3 5.57.80.4 5.57.80.5 IXP Manager option:

Correlate pings with other pings! member ping 5.57.80.12   5.57.80.52 5.57.80.48 5.57.80.76 5.57.80.91 Pinging a single host is limited by itself: more useful if we correlate Multiple members unreachable in the same interval. - May indicate an outage? Despite the limitations, we can correlate pings to make it more useful, for example, if a bunch of members become unreachable within the same time interval. This is more useful information.

Correlate other monitoring data member ping BGP RS1 RS2 Port ARP traffic errors … 5.57.80.12   50% 5.57.80.52 0% 5.57.80.48  7/10 99% 5068 5.57.80.76 38% 5.57.80.91 # My clever alert correlation script 1.0 if ($port_down) { if (…) { …lots of twisty code } $uptime = do_magic() # 2002-08-10: should # probably rewrite this # bit sometime… # 2018-01-28: LOL! @PORTS = get_snmp_voodoo() Correlating with other monitoring gives us more insight This is useful for monitoring  Makes a “single metric” calculation complex  It is both up and down? Wait a bit… We could correlate ping with other data we may have, such as the port status, BGP sessions, etc. This enables us to determine why a member might not responding to ping, or that they are up, even though they can’t be pinged. All useful for monitoring, but can be complex to use for an uptime calculation. A lot of data has to be worked through, and assuming we don’t lose any of it make some assumptions about the member’s availability.

Path availability R3 R2 R4 R1 R5 R10 R6 R9 R7 R8

Path availability possible paths = n * (n-1) / 2 10 * (10-1) / 2 = 45 R3 R2 possible paths = n * (n-1) / 2 10 * (10-1) / 2 = 45 (45 paths available = 100%) We consider every path, whether or not peering exists ASNs don’t peer with themselves. R4 R1 R5 R10 R6 R9 R7 R8 yes, this slide took forever to draw…

Exchange topology switch1 switch2 switch3 switch4 So how do we go about collecting the data in an IXP?

Exchange topology switch1 switch2 switch3 switch4 First, we need a monitoring device in every site, or connected to every switch. Every monitoring device pings every other device

Calculating path availability switch1 switch2 5 10 switch3 switch4 2 Since we know how many connected ports are in each site, we can easily work out the path availability impact of ports being unreachable 5

Calculating path availability switch1 switch2 5 10 switch3 switch4 2 In this example, we are unable to reach the monitoring station in a particular location. This affects 10 connections, which we can mark as down, rather than waiting to try and ping them all. 5

Calculating path availability switch1 switch2 5 Connected Ports 22 Possible paths 231 22*(22-1)/2 Down ports 10 Reduced paths by 105 10*(22-1)/2 Remaining 126 231-105 Path Availability 54.55% 10 switch3 switch4 2 5

Calculating path availability switch1 switch2 5 Connected Ports 314 Possible paths 49141 Down ports 10 Reduced paths by 1565 Remaining 47576 Path Availability 96.82% 10 switch3 switch4 2 Here is what it might look like with LONAP ports. In fact, the possible paths is maybe interesting but isn’t really needed – we get exactly the same result using just the port counts. Possibly we would need to correlate this with other data to account for things like partial outages etc. But it would be more reliable than just pinging members alone. 5

..another way to do it – by port capacity? Mbps Port 1 100 Port 2 1000 Port 3 Port 4 Port 5 10000 … Connected capacity 2339000 Capacity down -13100 Remaining availability 99.44 % switch1 switch2 5 10 switch3 switch4 2 This might be a fairer way to calculate it. Calculate the capacity of the ports affected and work it out as a percentage of total connected capacity. 5

..another way to do it – by port capacity? switch1 switch2 1 Port Mbps Port 1 100000 Connected capacity 2339000 Capacity down - 100000 Remaining availability 95.72 % 10 switch3 switch4 2 A single 100G port going down in a site would have more impact on the exchange than 24 1G ports going down. 5

…or use the switches themselves? No longer just a flat layer 2 network. The devices are layer 3. Every core link is an IP, point-to-point link …we could monitor these to work out “core availability” Maybe take into account traffic impact (down link may have no noticeable impact)

Is it a useful metric? Do we exclude things like maintenance? Exclude other factors “outside our control?” Is that realistic? Try not to obsess about the number! 100% 99.99% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%% 100% 100% 100% 100% 100% 100% 100% 100%

What LONAP members said… “Your job is to move packets. Just monitor ingress and egress packets” “Don’t spend a lot of effort creating this metric.” “Just focus on running a reliable service. Don’t break it.” Use SFLOW to detect problems (find increased TCP SYN)” “Use whatever metric internally if it helps. Probably not useful to publish it.” “You need more pictures of cats.”

What other EURO-IX IXPs said… “We tried and gave up.” “It’s too complicated to create a reliable number” “We do a complex calculation to create availability metrics” Should we try to develop some standard metrics?

Thoughts?