1 CS294 Project Virtual and Redundant Switches IRAM Retreat – Winter 2001 Sam Williams
2 CS294 Project Outline Motivation Existing Products Arrayed Commodity Switches Adding Redundancy Optimizing Generalization Conclusions
3 CS294 Project Motivation Cost of switches grows very quickly: O(Ports 2 ) for crossbar based Additionally address tables and buffers must grow Industry leading MTBF for a single switch is about 50K hours and typical is perhaps only 25K. Modular Switches provide redundancy for management and power, but not the data transport fabric. MTTR is typically over 1 hour Can the money saved by cascading commodity switches be applied towards improved performance or redundancy? The goals are to improve the MTBF, improve performance, and simplify the work that must be done to replace a failed switch.
4 CS294 Project Existing Products Existing modular aggregators can merge several smaller switches (modules) into a single large virtual switch. In this case, each 36 port switch module has a pair of gigabit uplinks to the switching fabric, which has either 6 or 24 gigabit ports (full duplex) Redundancy is also provided for management modules, fans, and power supplies. However, not for modules or switching fabric. So if the switching fabric fails, the entire device fails, but if individual switching modules fail, then only that sub network fails. Management modules can infer priority to improve performance for critical activity 3com switch 4007 Management 4 x 36 port switching modules, each with 2 gigabit uplinks 120 Gbps backplane (16 used) Logical View Switching Fabric: 24 internal gigabit ports
5 CS294 Project Existing Products (Analysis) The cost analysis here is based on use of either 18 or 48 Gbps switching fabrics, 36 port switching modules and either a 7 or 13 bay chassis. Performance is slowdown on the time to send from every node to every other node compared to a true n*36 port switch. MTBF is for any part of the network MTTR was at least 1 hour. Repair cost is about $4000/failure – modularization helps to keep this low, but yearly maintenance cost will grow with the number of ports
6 CS294 Project Examples of failure Switching module fails, each of the nodes/sub-networks attached is no disconnected from all other nodes More likely case Switching fabric fails, each of the switches is now disconnected from the others, but nodes attached to a switch still can communicate with each other.
7 CS294 Project Examples of failure (continued) Redundancy allows for this failure, with reduced performance. This are not commodity switches, and are considerably more expensive. However, in this case, the failure does cause a network split. This is the more likely case, so why not allow the extra switch be used to cover any other switch’s failure Could be extended to nodes, but then you pay double for NIC’s and ports.
8 CS294 Project Virtual switch from commodity switches Although without the management functions, and performance, cheaper virtual switches can be built – nothing more than just cascading them This is based on 5, 8, 16, and 24 port switches, each with the last port MDI type, and from 5 different companies Performance is poor since the uplinks are only 100Mbps Adding a second uplink port only moderately alleviates this deficiency
9 CS294 Project Virtual switch from mid-range switches By using switches more suited to this design (higher speed uplink(s)), we can improve performance These switches use an 8 or 24 port switch at the bottom, each with 1 or 2 gigabit uplink modules, and a 4, 8, or 12 port gigabit switch at the top The gigabit uplinks and gigabit switches drive cost to at least twice as much as commodity solution, but with 10x better performance Performance is near that of a monolithic switch if 2 uplinks are used. Compared to packaged solution, its about half the cost, and slightly less performance, but no management functionality.
10 CS294 Project Port Virtualization for Redundancy The re-mapping stage is much simpler than a full n*m port switch. Essentially each of the m n bit busses are mapped to one of the k n bit internal busses which are connected directly to the switches For this example each of the 4 groups of 8 virtual ports is mapped to one of the 5 groups of physical ports. The uplinks of the first stage switches are sent back, and into one of the top level switches. An even simpler solution, for single redundancy, would be to map either directly, or to the spare In this design the the single point of failure is the re-mapping block, since first and second level switches have redundancy So for the example below, MTBF is improved by about 50% (from 208 days to 347 days) port re-mapping Extra switches for redundancy
11 CS294 Project Operation (Homogenous switches) In this somewhat rigid example, there are 6 bays, 4 are map direct or to spare, There is a switching fabric slot, and a slot for the redundant switch, which can replace either of the other two classes In this case, the switching fabric switch failed, and the uplink ports were remapped to the spare. At this point the admin must replace the failed switch. If any other switch fails before this, the network will be partially split.
12 CS294 Project Operation - continued In this case, one of the first level of switches failed. Instead of those nodes loosing connection to the rest of the network, they are remapped to the spare. Once again, the admin must replace the failed switch. If any other switch fails before this, the network will be partially split. If the case had bee the spare went down, then it would need to be replaced to provide redundancy.
13 CS294 Project Port Virtualization for Higher Performance Previous performance analysis was based on “1-to-all” messaging. However, it is likely that network access patterns can be broken into groups of high inter-node communication Thus monitoring can be performed, and the network can be periodically paritioned into activity groups Create a graph based on bandwidth used between nodes, use something like Kernighan partitioning to separate it into a number of partitions equal to the number of first stage switches (power of 2). The re-mapping stage is only slightly simpler than a full n*m port switch (no buffers, never any contention, etc…) Logical View 3 switches reserved as spares. 1 failed, and the network was repartitioning
14 CS294 Project Performance / Availability MTTR for aggregators was typically over an hour. This is on top of the time to detect the failure. By automating recovery, the downtime can be significantly reduced This is dependent on timely detection of a failed switch, which could be handled via packet injection. Once the failing switch is determined, a new mapping can quickly be determined. For the performance optimizing case, satisfying connectivity is the top priority, a previously scheduled performance can be done later. Hard fail Fail detected Switches have adapted perf time repartition for performance Switches have adapted Hard fail Fail detected Switches have adapted perf time Hard fail admin notices & fixes failure Switches have adapted perf time...
15 CS294 Project Generalization Use homogenous switches. There is a mapping layer which maps physical to virtual ports. This can range from simple 1 to 2, to complex 1 to n, with performance monitoring and repartitioning. Performance can be gained by using some faster switches where needed. Extra switches for redundancy or extra performance monitor and port re-mapping #DescriptionFailsPerformanceCost 0Array of switches0LowN switches 1Single Redundancy1LowN switches trivial mapper 2R way redundancyRLowN switches + R + general mapper 3Array of switches with partitioning0AdaptiveN switches + expensive mapper 4R way redundancy with partitioningRAdaptiveN switches + R + expensive mapper 5 R way redundancy with partitioning And total utilization RAdaptiveN switches + R + expensive mapper
16 CS294 Project Conclusion It is possible to make a larger virtual switch out of smaller switches, and still get reasonable performance. With little additional hardware, and monitoring agent, it is possible to make it fault tolerant, with several spare switches which can be automatically swapped in – simple case cost ~ O(Spares * Ports). more complex designs make it O(Ports 2 ) With a very simple, but large switch, it is possible to also optimize for performance by balancing network bandwidth among switches in the pool. This is a much more costly solution. A generalization would provide a pool of switches connected by the port mapper, and some or none reserved as spares. Both of these concepts and their functionality could be integrated into a single ASIC or even using a network processor.
17 CS294 Project Future Work How do switches fail? This determines the failure detection method. Implementation of type 1 or 2 switch would be possible given the relative simple mapper. Type 3, 4, or 5 would require a complex ASIC, which should be replaced with a network processor and software.