A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean
Overview Structure and Properties of a Data Center Desired properties in a DC Architecture Fat tree based solution Evaluation of fat tree approach
Common data center topology
Problem With common DC topology Single point of failure Over subscription of links higher up in the topology – Typical over subscription is between 2.5:1 and 8:1 – Trade off between cost and provisioning
Properties of solutions Compatible with Ethernet and TCP/IP Cost effective – Low power consumption & heat emission – Cheap infrastructure – Commodity hardware Allows host communication at line speed – Over subscription of 1:1
Cost of maintaining switches
Review of Layer 2 & Layer 3 Layer 2 – Data Link Layer Ethernet MAC address – One spanning tree for entire network Prevents looping Ignores alternate paths Layer 3 – Transport Layer TCP/IP – Shortest path routing between source and destination – Best-effort delivery
FAT Tree based Solution Connect end-host together using a fat tree topology – Infrastructure consist of cheap devices Every port is the same speed – All devices can transmit at line speed if packets are distributed along existing paths – A k-port fat tree can support k 3 /4 hosts
Fat-Tree Topology
Problems with a vanilla Fat-tree Layer 3 will only use one of the existing equal cost paths Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity – Creates overhead at host as TCP must order the packets
FAT-tree Modified Enforce special addressing scheme in DC – Allows host attached to same switch to route only through switch – Allows inter-pod traffic to stay within pod – unused.PodNumber.switchnumber.Endhost Use two level look-ups to distribute traffic and maintain packet ordering.
2 Level look-ups First level is prefix lookup – Used to route down the topology to endhost Second level is a suffix lookup – Used to route up towards core – Diffuses and spreads out traffic – Maintains packet ordering by using the same ports for the same endhost
Diffusion Optimizations Flow classification – Eliminates local congestion – Assign to traffic to ports on a per-flow basis instead of a per-host basis Flow scheduling – Eliminates global congestion – Prevent long lived flows from sharing the same links – Assign long lived flows to different links
Results: Heat & Power Consumption
Implementation NetFPGA: 4 Gigabit Ports, 36 Mb SRAM 64MB DDR2, 3GB SATA Port Implemented elements in Click Router Software Two Level Table Initialized with preconfigured information Flow Classifier Distributes output evenly across local ports Flow Report + Flow Schedule Communicates with central schedule
Evaluation Purpose: measure bisection bandwidth Fat-Tree: 10 machines connected to 48 port switch Hierarchical: 8 machines connected to 48 port switch
Results
Related Work Myrinet – popular for cluster based supercomputers Benefit: low latency Cost: proprietary, host responsible for load balancing Infiniband – used in high-performance computing environments Benefit: proven to scale and high bandwidth Cost: imposes its own layer 1-4 protocol Uses Fat Tree Many massively parallel computers such as Thinking Machines & SGI use fat-trees
Conclusion The Good: cost per gigabit, energy per gigabit is going down The Bad: Datacenters are growing faster than commodity Ethernet devices Our fat-tree solution Is better: technically infeasible 27k node cluster using 10 GigE, we do it in $690M Is faster: equal or faster bandwidth in tests Increases fault tolerance Is Cheaper: 20k hosts costs $37M for hierarchical and $8.67M for fat-tree (1 GigE) KO’s the competing data center’s