VL2: A Scalable and Flexible Data Center Network Greenberg, et. al. SIGCOMM ‘09
Some refresher… Layer 2 (Data link layer) Layer 3 (Network layer) ARP Addressing: MAC address Learning: Flooding Switched Minimum Spanning Tree Semantics: Unicast Multicast Broadcast Layer 3 (Network layer) Addressing: IP Address Learning: Routing protocol Dynamic BGP OSPF Static Routed ARP Discovery protocol Ties IP back to MAC Utilizes layer 2 broadcast semantics
Fat tree VL2, written by Microsoft Research and presented by Sargun Dhillon https://www.youtube.com/watch?v=k6dEK0oz0gg
Layer 2 Learning VL2, written by Microsoft Research and presented by Sargun Dhillon https://www.youtube.com/watch?v=k6dEK0oz0gg
Minimum Spanning Tree VL2, written by Microsoft Research and presented by Sargun Dhillon https://www.youtube.com/watch?v=k6dEK0oz0gg
Layer 3 Routing VL2, written by Microsoft Research and presented by Sargun Dhillon https://www.youtube.com/watch?v=k6dEK0oz0gg
Architecture of Data Center Networks (DCN)
Conventional DCN Problems CR CR 1:240 AR AR AR AR S S I have spare ones, but… S S I want more 1:80 . . . S S S S S S S S 1:5 A A … A A A … A A A … A A A … A Static network assignment Fragmentation of resource Poor server to server connectivity Traffics affects each other Poor reliability and utilization Hakim Weatherspoon, High Performance Systems and Networking Lecture Slides https://www.cs.cornell.edu/courses/cs5413/2014fa/lectures/09-vl2.pptx
Objectives Uniform high capacity: Performance isolation: Maximum rate of server to server traffic flow should be limited only by capacity on network cards Assigning servers to service should be independent of network topology Performance isolation: Traffic of one service should not be affected by traffic of other services Layer-2 semantics: Easily assign any server to any service Configure server with whatever IP address the service expects VM keeps the same IP address even after migration
Measurements and Implications of DCN (1) Data-Center traffic analysis: Traffic volume between servers to entering/leaving data center is 4:1 Demand for bandwidth between servers growing faster Network is the bottleneck of computation
Measurements and Implications of DCN (2) Flow distribution analysis: Majority of flows are small, biggest flow size is 100MB The distribution of internal flows is simpler and more uniform 50% of node averaged 10 concurrent flows, but 5% of flow is greater than 80 concurrent flows
Measurements and Implications of DCN (3) Traffic matrix analysis: Poor summarizing of traffic patterns Instability of traffic patterns The lack of predictability stems from the use of randomness to improve the performance of data- center applications
Measurements and Implications of DCN (4) Failure characteristics: Pattern of networking equipment failures: 95% of failures resolved within < 1min, 98% < 1hr, 99.6% < 1 day, 0.09% > 10 days No obvious way to eliminate all failures from the top of the hierarchy
Virtual Layer 2 Switch (VL2) Design principle: Randomizing to cope with volatility: Using Valiant Load Balancing (VLB) to do destination independent traffic spreading across multiple intermediate nodes Building on proven networking technology: Using IP routing and forwarding technologies available in commodity switches Separating names from locators: Using directory system to maintain the mapping between names and locations Embracing end systems: A VL2 agent at each server
Topology
Addressing and Routing (1) Packet forwarding VL2 agent to trap packet and encapsulates it with LA routing Address resolution Servers virtually in a same subnet ARP is captured by agent to redirected as unicast request to directory server Directory server replied with routing info (which then cached by agent) Access Control via directory server
Addressing and Routing (2) Int Int Valiant Load Balancing (VLB) for random route When going up, choose random node Equal Cost Multi Path Forwarding (ECMP) for fail-safe Assign the same LA to multiple node When a node fail, use IP anycast to route to backup node TCP for congestion control ToR ToR Srv Srv
Directory System
Backwards compability Interaction with hosts in the internet External traffic can directly flow without being forced through gateway servers to have their headers rewritten Server that needs to be reachable externally are assigned an LA in addition to AA for direct communication Handling broadcast ARP is replaced by Directory System DHCP is intercepted by agent and sent as unicast tx directly to DHCP server
Evaluations (1) Uniform high capacity: All-to-all data shuffle stress test: 75 servers, deliver 500 MB Maximal achievable goodput is 62.3 VL2 network efficiency as 58.8/62.3 = 94% (of maximum achievable goodput)
Evaluations (2) Fairness: 75 nodes Real data center workload Plot Jain’s fairness index for traffics to intermediate switches
Evaluations (3) Performance isolation: Two types of services: Service one: 18 servers do single TCP transfer all the time Service two: new server starts every 2 seconds, each transmits 8 GB transfer over TCP on start, up to total of 19 servers
Evaluations (4) Convergence after link failures 75 servers All-to-all data shuffle Disconnect links between intermediate and aggregation switches
Evaluations (5) Directory System’s performance 40K lookups/s with response time < 1 ms 0.06% from entire server is needed for worst case scenario Convergence latency is within 100 ms for 70% of the updates and 530 ms for 99%
Evaluations (6) Compare VLB with: Adaptive routing (e.g., TeXCP) as upper bound Best oblivious routing (VLB is one of oblivious routing) VLB (and best oblivious) is close to adaptive routing scheme in terms of link utilization
Summary VL2 consolidates layer 2 and 3 into single “virtual layer-2” Key design: Layer-2 semantics -> flat addressing Uniform high capacity -> guaranteed bandwidth using VLB Performance isolation -> enforce hose traffic model using TCP
Questions? With the bipartite graph between aggregate and intermediate switches, is there a limit to the network size without oversubscription? The VLB design guarantees bandwidth without over-subscription, and the author claims that network size can be scaled as many as the budget allowed Given DI number of port in intermediate switches, and DA number of port on aggregate switches, the maximum number of server is 20(DA*DI/4) How is the implementation of VL2 compared to today's (2018) state of the art data-center network design? Today’s network designs are focused on software defined networking Since there is one more layer that handle the routing, will it become the new bottleneck of this system if every request is small but the number of request is very large? Unanswered