Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter: Fuchao Zhou
Department of Computer Science Problem How to design Data Center Network Architecture -- Scalable interconnection bandwidth -- Without incurring tremendous cost -- Compatibility with hosts running Ethernet and IP
Department of Computer Science Existing solutions Using specialized hardware and communication protocols such as InfiniBand and Myrinet -- More expensive for using high-end switches -- Not natively compatible with TCP/IP applications Using commodity Ethernet switches and routers to interconnect cluster machines -- Need appropriate network topology -- Bandwidth scales poorly with cluster size -- Non-linear cost increases with cluster size
Department of Computer Science Existing solutions Typical architectures today -- Two-level trees of switches or routers (supports 5K to 8K hosts) -- Three-level trees of switches or routers Disadvantages -- only support 50% bandwidth available at the edge of the network -- incurring tremendous cost($37M to supports 27,648 hosts)
Department of Computer Science Proposed solution Typical architectures today -- k pods, each containing two layers of k/2 switches -- (k/2) 2 k-port core switches -- supports k 3 /4 hosts(48-ary fat-tree supports 27,648 hosts) k-ary fat-tree topology Advantages -- non-blocking -- all switching elements are identical ($8.64M to supports 27,648 hosts) -- compatible with hosts running Ethernet and IP
Department of Computer Science Static Routing method two-level routing table -- maximum bisection bandwidth in this network IP address -- Core switches:10.k.j.i -- Pod switches: 10.pod.switch.1 -- Hosts:10.pod.switch.ID
Department of Computer Science Static Routing example Packet from to host Packet from to host PrefixOutput port / / / / /83 PrefixOutput port / / / /163 PrefixOutput port / / /0 0 1
Department of Computer Science Dynamic Routing methods flow classification 1. Recognize subsequence packets of the same flow, and forward them to the same outgoing port against packet reordering; 2. Periodically reassign output ports to ensure fair distribution on flows on output ports in the face of dynamically changing flow size.
Department of Computer Science Dynamic Routing methods flow scheduling (with a central scheduler) Method1:(notification) 1. Edge switches detect any outgoing large flow 2. Send notifications to a central scheduler periodically 3. The central scheduler order a re-assignment; Method2:(monitor) 1. A central scheduler tracks all active large flows 2. Assign them non-conflicting paths if possible. 3. The scheduler maintains Boolean state for all links
Department of Computer Science Fault-Tolerance Simple failure broadcast protocol -- Each switch maintains a Bidirectional forwarding Detection session(BRD )(D.Datz, D.Ward. BFD for IPv4 AND IPv6, 2008) Two classes of failures
Department of Computer Science Fault-Tolerance based on the flow classification(1) Outgoing inter- and intra-pod traffic originating from the edge switch Intra-pod traffic using the upper-layer switch as an intermediary Inter-pod traffic coming into the upper-layer switch
Department of Computer Science Fault-Tolerance based on the flow classification(2) Outgoing inter-pod traffic Incoming inter-pod traffic
Department of Computer Science Fault-Tolerance based on the flow scheduling Simpler The scheduler marks any link reported to be down as busy or unavailable
Department of Computer Science Limitations The performance evaluation of a prototype of the architecture consisting of 4 pods(16 hosts) Fat-tree topology is wiring overhead -- 3k 3 /4 wire cables for a k-ary fat tree -- e.g. k=48, supporting 27,648 hosts. 3*48 3 /4=82,944 wire cables --. How many changes for the commodity switches should be considered. --don’t support the dynamic routing techniques -- don’t support two-level routing table
Department of Computer Science Limitations Dynamic routing techniques also have limitations - -- flow classifier just only has local knowledge available -- centralized scheduler with global knowledge may be infeasible for large arbitrary network two-level routing solution cannot avoid local congestion without dynamic routing technique
Department of Computer Science Q&A
Department of Computer Science Extra slides