Network Topologies CIS 700/005 – Lecture 3

Network Topologies CIS 700/005 – Lecture 3
Includes material from Brighten Godfrey

Last Time… K-ary fat tree: three-layer topology (edge, aggregation and core) each pod consists of (k/2)2 servers & 2 layers of k/2 k-port switches each edge switch connects to k/2 servers & k/2 aggr. switches each aggr. switch connects to k/2 edge & k/2 core switches (k/2)2 core switches: each connects to k pods

This Time… Clos networks were designed for a telephones Example 1: F10
There are a bunch of other things you might want Example 1: F10 Fast failover and graceful capacity utilization Keep the other properties of Clos networks Example 2: Jellyfish Incremental deployment and cheaper networks Throw out Clos networks and wire randomly

F10: A Fault-Tolerant Engineered Network
Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, Thomas Anderson

Failure Handling Today
And how that works today is pretty straightforward. Every switch in this network is going to constantly send heartbeats to indicate that it’s still working. These could be sent to either their neighboring or a centralized controller, but either way, as soon as a device fails, it will stop sending heartbeats. When it stops, we assume that the switch has failed and then notify the other switches in the network who can then stop using the failed device. This whole recovery process can often take tens of milliseconds to sometimes even seconds, which can really impact performance. What’s more: even after we’ve recovered connectivity, the network itself is left in a sort of unbalanced state that destroys all of the nice load balancing properties of Clos networks. I’ll go over both of those issues later in the talk, but suffice it to say, that in a modern data center, when a failure occurs, it causes problems. The other part of this is that, in the old days, the assumption was that failures were relatively rare so this isn’t a huge deal, but it turns out that failures are happening all the time. As I mentioned before, these networks typically thousands of switches in them, so even though each individual switch might have 99.9% uptime, the probability that something in the network is failing is very high. And so we’re constantly needing to recover from failures and deal with a degraded network Recovery is slow, network is left unbalanced Failures are happening all the time!

Routing In Clos Networks
And, to answer that, we first need to go into a little more detail on how things are done today. This diagram shows a Clos network similar to the one we were looking at before, except that here we have 4 levels of switches each of them with 4 ports, 2 going up and 2 going down. In a Clos network, when a source like this one on the right wants to send a packet to a destination like this one on the left, it does so by sending the packet upwards until it reaches a switch that has a direct path down to the destination. In this case, the first switches that have a direct path down are these switches at the very top. Note that are actually many such paths. Just at the first hop, the source has two choices, each of which can reach the destination. The way you typically choose between these paths is using a protocol called "Equal-cost multipath" or ECMP. What ECMP does is pretty simple: when a new connection comes in to a given switch, the switch makes a random, local decision about which output port to use. In this example, that means that the source switch will give 50% of connections to its left parent and 50% to its right parent. Both second hops will do the exact same thing, placing 50% of connections on the left and 50% on the right and so on and so forth until we get to the top, where each switch has one and only one path down to the destination. Without failures this strategy actually ends up working pretty well. dst src Many possible paths Choose among them using Equal Cost MultiPath (ECMP)

Routing With (Upward) Failures
But let's actually look look at what happens in the case of a failure. And there are actually two cases we'll have to consider: a failure on the upward half of the path and a failure on the downward half of the path. In reality, every failure is going to be on the upward half of *some* paths and downward half of others, but conceptually, it’s easier if we separate those two cases. But okay, let's say we have a failure here at the red X, on the upward half of this path. ECMP actually does alright in this case. The switch prior to the failure can immediately and locally restore connectivity at the point of the failure. It can do so because it’s both in the perfect position to detect the failure and in the perfect position to route around it. So okay. That seems pretty positive, but now let's go back... dst src Lots of redundancy on the upward path Can immediately restore connectivity at the point of failure

Routing With (Downward) Failures
...and take a look at what happens when we have a failure on the downward half of the path. We can't quite take the same approach here since the node immediately prior to the failure has no direct, alternative paths and therefore cant immediately OR locally recover from the failure. In fact, if we trace this path backwards, we see most of the switches on the path do not have a direct path around the failure--only the source switch has an alternative. And actually, if we look across all paths that cross the failure on their toward the destination, we see that, in this example, the switches that have alternatives, the switches that need to be involved in recovery, are all on the other side of the network. dst src src No redundancy on the way down Alternatives are many hops away No direct path Has alternate path

Like upward recovery, we want alternate paths closer to the failure
Routing With (Downward) Failures And so the key challenge in making failure recovery fast is that, we need to make downward failure recovery as fast as upward failure recovery. Specifically, we’d like to have alternate paths closer to the failure. Achieving that would enable any detecting switch to just shift over to a new path immediately and locally. Now… how do we do that? The thing we realized in F10 was that the underlying problem is actually the symmetry of this topology. A little bit of symmetry can be a good thing. It makes routing and load balancing much easier. But in this case, it’s a little too symmetrical. Each node is connected to the exact same set of parents as its siblings, the exact same set of grandparents as its cousins, and so on. That’s why, as soon as we get too far into this failure’s family tree, no matter which upward choice I make, there is no way around the failure. What we propose is to break this symmetry in a controlled way. dst src src No redundancy on the way down Alternatives are many hops away Like upward recovery, we want alternate paths closer to the failure No direct path Has alternate path

Type A Subtree 1 2 3 4 x y Consecutive Parents
To this end, we introduce a Type A subtree, that’s connected in the same way as a classic fat tree. In this type of tree, nodes are connected to consecutive parents…

Type B Subtree 1 2 3 4 x y Strided Parents
We also introduce a type B subtree that’s connected in a strided fashion. Here, switch x is connected to… It’s important to note that if you build a tree out of only type A subtrees it’s clearly just a traditional FatTree But if you build a tree out of only type B subtrees, its also identical to a traditional FatTree—it’s just drawn differently. The difference comes…

AB Clos Networks …when you mix them to form an AB clos network, where half of the subtrees are of type A and half are of type B. This topology retains all of the benefits of a traditional Clos network. The principle difference is how it deals with failures

Alternatives in AB Clos Networks
If we go back to this example where we’ve got a failure on the downward half of the path, we now have more alternative paths, closer to the point of the failure. In fact, instead of needing to inform switches all the way across the network, this topology guarantees that every switch is at most one hop away from a switch with an alternate path. For example, consider the left parent of the failure. All it needs to do is reroute to a sibling in an opposite-typed subtree. That sibling is guaranteed to have that alternative because it’s no longer connected to the same set of parents as the failure. This technique also generalizes to handle multiple failures. And actually, for a topology where that parent has p ports going down, we can handle up to p/2+1, which in a typical data center, would mean somewhere around 25 targeted failures. dst src src Guaranteed to be one hop away from an alternate path Just route to an opposite-typed subtree No direct path Has alternate path

Cascaded Failover Protocols
A local rerouting mechanism Immediate restoration A pushback notification scheme Restore direct paths An epoch-based centralized scheduler globally re-optimizes traffic μs ms s

Jellyfish: Networking Data Centers Randomly
Ankit Singla, Chi-Yao Hong, Lucian Popa, P. Brighten Godfrey

Structure constrains expansion
Coarse design points Hypercube: 2k switches de Bruijn-like: 3k switches 3-level fat tree: 5k2/4 switches Fat trees by the numbers: (3-level, with commodity 24, 32, 48, ... port switches) 3456 servers, 8192 servers, servers, … Unclear how to maintain structure incrementally Overutilize switches? Uneven / constrained bandwidth Leave ports free for later? Wasted investment

Forget about structure – let’s have no structure at all!
Jellyfish’s approach Forget about structure – let’s have no structure at all!

Network Topologies CIS 700/005 – Lecture 3

Similar presentations

Presentation on theme: "Network Topologies CIS 700/005 – Lecture 3"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Network Topologies CIS 700/005 – Lecture 3

Similar presentations

Presentation on theme: "Network Topologies CIS 700/005 – Lecture 3"— Presentation transcript:

Similar presentations

About project

Feedback