CS434/534: Topics in Network Systems Cloud Data Centers: VL2 Control; VLB/ECMP Load Balancing Routing Yang (Richard) Yang Computer Science Department.

CS434/534: Topics in Network Systems Cloud Data Centers: VL2 Control; VLB/ECMP Load Balancing Routing Yang (Richard) Yang Computer Science Department Yale University 208A Watson Acknowledgement: slides include content from classes by M. Alizadeh, and Presto authors.

Outline Admin and recap Cloud data center (CDC) networks
Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation Overview Topology Control layer 2 semantics ECMP/VLB load balancing/performance isolation Extension: Presto

Admin PS1 status Please set up meetings on potential projects

Recap: Data Centers Largest cost component of data center (DC) is servers, but utilization of servers is often low Goal of a DC infrastructure: agility Turn the servers into a single large fungible pool Dynamically expand and contract service footprint as needed

Recap: Problems of Conventional DC
Internet CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers S S . . . S S S S A A … A A A … A ~ 1,000 servers/pod == IP subnet Heterogenous server-to-server capacity Poor reliability Partition by IP subnet limits agility

Recap: Objectives of VL2
Layer-2 semantics: Easily assign any server to any service Assigning servers to service should be independent of network topology Configure server with whatever IP address the service expects VM keeps the same IP address even after migration Uniform high capacity: Maximum rate of server to server traffic flow should be limited only by capacity of network cards Performance isolation: Traffic of one service should not be affected by traffic of other services (need the above bound)

Recap: Generic K-ary Fat Tree Topo
Motivated by non-blocking Clos networks K-ary fat tree: three-layer topology (edge, aggregation and core) K3/4 # servers Same # of links between each two layers (Core-Aggr, Aggr-Edge, Edge-Serv) K3/4 - k-port switch, k pods Each pod k/2 edge switches k/2 * k/2 servers per pod total k/2 * k/2 * k servers k/2 aggr switches k/2 * k/2 / k core switches per pod total k/2 * k/2 core switches K^3/4 K3/4 K3/4

Recap: VL2 Topology VL2 DA / 2 Int switches . . . . . .
Assume Each Int switch has DI ports; Each Aggr has DA ports Recap: VL2 Topology VL2 Q: Why not same#? DA / 2 Int switches Int . . . DADI/2 Aggr . . . DI Aggr switches DADI/2 Assume D_I ports per IS; D_A ports per AS Each aggr switch uses half ports to connect to TOR switches, half to each Intermediate switch => D_A/2 intermediate switches Each IS connects to all Aggr switches => D_I aggr switches - Each TOR connects to 2 aggr switches => D_I * D_A /2 /2 = D_I D_A /4 TOR switches - Each server connects 20 servers => Servers: 20 KD/4 . . . TOR DI DA/4 TOR . . . 20DADI/4 20 Servers 20 (DI DA/4) servers 8

Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation Overview Topology Control

FatTree Topology is great, But…
Does using fat-tree topology to inter-connect racks of servers in itself sufficient—we can use any control plane? How about traditional layer 2 switching (ARP+Learn) host churns, ARP flooding, spanning tree removes most capacities ! How about traditional layer 3 IP routing shortest path routing to each server constructing a path for each server as a dst will need large flow tables assume 10 million virtual endpoints in 500,000 servers in datacenter => 10 m entries, but typical switch has only 640K switch memory, for 32-64K flow entries aggregation to reduce flow table size VM cannot move easily as address becomes locator Layer 3 will only use one of the existing equal cost paths Bottlenecks up and down the fat-tree Simple extension to IP forwarding Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity ; further load may not necessarily be well-balanced Wiring complexity in large networks Packing and placement technique

VL2 Solution to Addressing and Routing: Name-Location Separation
Whole network as a L2 domain w/o scaling bottleneck. VL2 Directory Service … x  ToR2 y  ToR3 z  ToR3 … x  ToR2 y  ToR3 z  ToR4 ToR1 . . . ToR2 . . . ToR3 . . . ToR4 ToR3 y payload Lookup & Response x y y, z z ToR4 ToR3 z z payload payload Servers use flat names Routing uses locator (ToR address) 11

Discussion Requirements on the Directory System?
What is a possible design? Directory Service … x  ToR2 y  ToR3 z  ToR3 … x  ToR2 y  ToR3 z  ToR4

VL2 Directory System . . . Q: Stale mappings? RSM
DS Agent . . . Directory Servers RSM Servers 2. Reply 1. Lookup “Lookup” 5. Ack 2. Set 4. Ack (6. Disseminate) 3. Replicate 1. Update “Update” Write-optimized Replicated State Machines using Paxos for reliable updates Directory servers: low latency, high throughput, high availability for a high lookup rate RSM: strongly consistent, reliable store of AA-to-LA mappings Reactive cache updates: stale host mapping needs to be corrected only when that mapping is used to deliver traffic. Forward non-deliverable packets to a directory server, so directory server corrects stale mapping in source’s stale cache via unicast Read-optimized Directory Servers for fast lookups Q: Stale mappings?

Routing Design Option I0 I1 I2 T1 T2 T3 T4 T5 T6 x z
y z payload payload x z Remaining issue: What are the path(s) for each srcTor/dstTor? 15

Example Assume 8 port switches Each link is 10G Int Aggr . . . Q: An example routing which can lead to contention/no isolation?

Example Assume 8 port switches Each link is 10G Int 6 8 Aggr . . . Objective: Spread traffic so that no such contention can happen, as long as each host is bounded by interface face card rate.

Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation Overview Topology Control layer 2 semantics VLB/ECMP load balancing/performance isolation

Offline: Traditional Valiant Load Balancing for Hose Model
We will start with the simple (but unrealistic) homogeneous case where all the backbone nodes have the same capacity, r. In this case, a VLB network consists of a full mesh of logical links with capacity 2r N , as shown in Figure 2. Traffic entering the backbone is load-balanced equally across all N one- and two-hop paths between ingress and egress. A packet is forwarded twice in the network: In the first hop, a node uniformly load-balances each of its incoming flows to all the N nodes, regardless of the packet destination. Load-balancing can be done packet-by-packet, or flow-by-flow, and each node receives 1 N of every flow in the first hop. In the second hop, all packets are delivered to the final destinations. VLB has the nice characteristic that it can support all traffic matrices that do not oversubscribe a node. Since the incoming traffic rate to each node is at most r, and the traffic is evenly load-balanced to N nodes, the actual traffic on each link due to the first hop routing is at most r N . The second hop is the dual of the first. Since each node can receive traffic at a maximum rate of r and receives 1 N of the traffic from every node, the traffic on each link due to the second hop is also at most r N . Therefore, the full-mesh network (with link capacities 2r N ) can support all traffic matrices

Valiant Load Balancing Intuition in VL2 Setting (Aggr-Int)
6 8 Aggr . . . a Alg: spread traffic uniformly to the Int switches Q: Effect on the example? Q: Bound (assume DI = DA = 8): a -> i traffic ¼ of total a upstream traffic (<= 10G) i -> a traffic ¼ of total traffic going down to a (<= 10G)

VLB Realization 1 I0 I1 I2 T1 T2 T3 T4 T5 T6 x z
y z payload payload x z Endhost picks a random Int (e.g., I0) and encap Net ECMP routing; Int switches and ToR switches do decap Q: all upstream paths and downstream paths? How may ECMP use multi paths? 21

VLB Realization 1: Problem
y z payload payload x z Problem: Need to update each host if an Int switch changes state. 22

Final VLB Realization IANY IANY IANY T1 T2 T3 T4 T5 T6 IANY T5 T3 z y payload payload x z VL2: All Int switches assigned the same anycast addr. Q: all upstream paths and downstream paths? 23

Offline Thinking Int Aggr . . . What is a bad setting if there is no second encap?

Implementation Question
IANY IANY IANY T1 T2 T3 T4 T5 T6 In particular, how ecmp might be implemented. IANY T3 T5 y z payload payload x z What are tables and actions at each network node? 25

VL2 Agent in Action VLB ECMP src IP H(ft) dst IP Int LA src IP H(ft)
dstToR LA dst IP Int ( ) src AA dst AA payload ( ) ToR ( ) ( ) ToR ( ) VLB Why hash? Why double encap? Intermediate availability change => update a large number of VL2 agents Solution: assign the same IP address to all IS ECMP -> send to any one Problem: ecmp only 16-way ECMP VL2 Agent 26

Question to Think Offline
“In 3.2, the paper states that randomizing large flows won't cause much perpetual congestion if misplaced since large flows are only 100 MB and thus take 1 second to transmit on a 1 Gbps link. Isn't 1 second sufficiently high to harm the isolation that VL2 tries to provide?”

Summary: VL2 Objectives and Solutions
1. Layer-2 semantics Flat address; Name-location separation & resolution service 2. Uniform high capacity between servers Multi-root tree topology 3. Performance Isolation Flow-based random traffic indirection (Valiant LB)

Evaluation Uniform high capacity: All-to-all data shuffle stress test:
75 servers, deliver 500MB Maximal achievable goodput is 62.3 VL2 network efficiency as 58.8/62.3 = 94%

Evaluation Performance isolation: Two types of services:
Service one: 18 servers do single TCP transfer all the time Service two: 19 servers starts a 8GB transfer over TCP every 2 seconds Service two: 19 servers burst short TCP connections

Critique Extra servers are needed to support the VL2 directory system
Brings more cost on devices All links and switches are working all the times, not power efficient Effectiveness of isolation (load balancing) through VLB/ECMP randomization depends on traffic model

Randomization and Load Balancing: Intuition
Load Balancing vs Item Sizes 20×1Gbps Uplinks 2×10Gbps 11×1Gbps flows (55% load) Prob of 100% throughput = 3.27% 1 2 20 Prob of 100% throughput = 99.95% No bucket gets > 1 No bucket gets > 10 1 2 33

Randomization and Load Balancing: In Implementation
VL2 realizes randomization through ECMP hash of 5 tuples Collision happens when there is a hash collision Local & stateless (bad with asymmetry; e.g., due to link failures) H(f) % 3 = 0

Discussion When may randomization lb perform badly?
How to reduce/avoid bad lb?

Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation Overview Topology Control layer 2 semantics ECMP/VLB load balancing/performance isolation Extension: Presto

Presto in Context ECMP: Per-flow lb Per-packet Flowlets
Elephant collisions Per-packet High computational overhead Heavy reordering including mice flows Flowlets Burst of packets separated by inactivity timer Effectiveness depends on workloads small inactivity timer large A lot of reordering Mice flows fragmented Large flowlets (hash collisions)

Presto LB Granularity: Flowcells
What is flowcell? A set of TCP segments with bounded byte count How to choose flowcell size? Implementation feasibility TCP Segmentation Offload (TSO) size Maximize the benefit of TSO for high speed 64KB in implementation Instead, Presto load balances on flowcells. Flowcell essentially is a set of TCP segments with bounded byte count. The bound is the maximal TCP Segmentation Offload (TSO) size because we want to maximize the benefit of TSO for high speed. It was 64KB in our implementation. So, what’s TSO? When a system needs to send large chunks of data, TCP/IP stack operates on large segments and passes those large segments down to the NIC. Then NIC TSO helps perform segmentation and checksumming. TSO technique can help reduce processing overhead of the end host stack.

Intro to TSO TCP/IP Large Segment NIC Segmentation & Checksum Offload MTU-sized Ethernet Frames TSO important for software: w/o TSO, a host incurs 100% utilization of one CPU core and can only achieve around 5.5 Gbps Instead, Presto load balances on flowcells. Flowcell essentially is a set of TCP segments with bounded byte count. The bound is the maximal TCP Segmentation Offload (TSO) size because we want to maximize the benefit of TSO for high speed. It was 64KB in our implementation. So, what’s TSO? When a system needs to send large chunks of data, TCP/IP stack operates on large segments and passes those large segments down to the NIC. Then NIC TSO helps perform segmentation and checksumming. TSO technique can help reduce processing overhead of the end host stack. To better understand the concept of flowcell, here we show a few examples. The first example, we show the first three segments of a TCP flow, whose size is 25KB, 30KB and 30KB respectively. In this case, the first two TCP segments belong to 1 flowcell. TCP segments 25KB 30KB 30KB Flowcell: 55KB Start 39

Sender breaks data into flowcells
Presto at a High Level Spine Leaf Set up multiple paths This kind of fine-grained load balancing scheme can cause packet reordering at the receiver side, therefore, at the receiver side, Presto masks packet reordering due to multipathing below the transport layer. NIC Sender breaks data into flowcells NIC vSwitch vSwitch TCP/IP TCP/IP Receiver masks packet reordering due to multipathing below transport layer

Controller installs label-switched paths
Presto Sender Spine Leaf Controller installs label-switched paths NIC NIC vSwitch vSwitch TCP/IP TCP/IP Host A Host B

Presto Sender Spine Leaf vSwitch vSwitch TCP/IP TCP/IP Host A Host B
NIC uses TSO and chunks segment #1 into MTU-sized packets Leaf flowcell #1: vSwitch encodes flowcell ID, rewrites label id,label NIC NIC vSwitch 50KB vSwitch receives TCP segment #1 vSwitch TCP/IP TCP/IP Host A Host B

Presto Sender Spine Leaf vSwitch vSwitch TCP/IP TCP/IP Host A Host B
NIC uses TSO and chunks segment #2 into MTU-sized packets Leaf flowcell #2: vSwitch encodes flowcell ID, rewrites label id,label NIC NIC vSwitch 60KB vSwitch receives TCP segment #2 vSwitch TCP/IP TCP/IP Host A Host B

Benefits Most flows smaller than 64KB [Benson, IMC’11]
the majority of mice are not exposed to reordering Most bytes from elephants [Alizadeh, SIGCOMM’10] traffic routed on uniform sizes Fine-grained and deterministic scheduling over disjoint paths near optimal load balancing

Discussion IANY IANY IANY T1 T2 T3 T4 T5 T6 In particular, how ecmp might be implemented. IANY T3 T5 y z payload payload x z Is it possible for Presto to still send too much traffic on a link? 45

Backup Slides

Presto Receiver Major challenges
Packet reordering for large flows due to multipath Distinguish loss from reordering Fast (10G and beyond) Light-weight

Intro to GRO Generic Receive Offload (GRO) The reverse process of TSO
First, we need to introduce a technique called Generic Receiver Offload, or GRO. GRO is the reverse process of TSO.

Intro to GRO TCP/IP GRO NIC OS Hardware
GRO logic is located below the TCP/IP stack in the OS and above the NIC.

Intro to GRO TCP/IP GRO NIC MTU-sized Packets Queue head P1 P2 P3 P4
Now let’s see how GRO logic works. Let’s say a bunch of MTU-sized small packets arrive at the NIC’s queue.

Intro to GRO TCP/IP GRO NIC Merge MTU-sized Packets Queue head P1 P2
Then GRO logic tries to merge those MTU-sized packet into large TCP segments.

Intro to GRO TCP/IP GRO NIC Merge MTU-sized Packets Queue head P1 P2

Intro to GRO TCP/IP GRO NIC Merge MTU-sized Packets Queue head P1 – P2

Intro to GRO TCP/IP GRO NIC Push-up
P1 – P5 GRO Push-up MTU-sized Packets NIC The merged large TCP segments are pushed-up to TCP/IP at the end of a batched IO event, or a polling event. Large TCP segments are pushed-up at the end of a batched IO event (i.e., a polling event)

Intro to GRO TCP/IP GRO NIC Push-up MTU-sized Packets
P1 – P5 GRO Push-up MTU-sized Packets NIC The basic idea behind is GRO is that spending a few cycles to aggregate packets within GRO creates less segments for TCP and prevents having to use substantially more cycles at higher layers in the networking stack. If GRO is disabled, we can only get ~6Gbps throughput with 100% CPU usage of one core. Merging pkts in GRO creates less segments & avoids using substantially more cycles at TCP/IP and above [Menon, ATC’08] If GRO is disabled, ~6Gbps with 100% CPU usage of one core

Reordering Challenges
TCP/IP GRO NIC P1 P2 P3 P6 P4 P7 P5 P8 P9 Now we will show “how GRO logic is broken in face of massive packet reordering”. Here we show an example that packets arrive at the NIC our of order. Out of order packets

TCP/IP P1 GRO NIC P2 P3 P6 P4 P7 P5 P8 P9 GRO still tries to merge packets into segments. It first process P1.

TCP/IP P1 – P2 GRO NIC P3 P6 P4 P7 P5 P8 P9 P2 is merged with P1.

TCP/IP P1 – P3 GRO NIC P6 P4 P7 P5 P8 P9 P3 is merged into the existing segment too.

TCP/IP P1 – P3 P6 GRO NIC P4 P7 P5 P8 P9 When GRO processes P6, we meet an issue – GRO logic is designed to be fast and simple; it pushes-up the existing segment immediately when 1)There is a gap in TCP sequence number, or 2)Maximum Segment Size is reached or 3)timeout is fired. In this case, there is a TCP sequence number gap between “P1-P3” and “P6”. GRO is designed to be fast and simple; it pushes-up the existing segment immediately when 1) there is a gap in sequence number, 2) MSS reached or 3) timeout fired

P1 – P3 TCP/IP P6 GRO NIC P4 P7 P5 P8 P9 So “P1-P3” is pushed-up immediately and P6 becomes the existing segment.

P1 – P3 P6 TCP/IP P4 GRO NIC P7 P5 P8 P9 When GRO processes P4, it meets similar problem – there is a sequence number gap. Therefore, P6 is pushed-up and P4 becomes the existing segment.

P1 – P3 P6 P4 TCP/IP P7 GRO NIC P5 P8 P9 This process continues…

P1 – P3 P6 P4 P7 TCP/IP P5 GRO NIC P8 P9 This process continues…

P1 – P3 P6 P4 P7 P5 TCP/IP P8 GRO NIC P9 This process continues…

P1 – P3 P6 P4 P7 P5 TCP/IP P8 – P9 GRO NIC P9 can be merged with P8.

P1 – P3 P6 P4 P7 P5 P8 – P9 TCP/IP GRO NIC Finally, all the packets are pushed-up to TCP/IP.

GRO is effectively disabled Lots of small packets are pushed up to TCP/IP Huge CPU processing overhead We can see that, in case of massive packet reodering, GRO is effectively disabled. Lots of small packets are pushed-up to TCP/IP. This has two unfortunate consequences:1) There is huge CPU processing overhead on the end-host because the networking stack needs to process those small packets 2)TCP performance is poor due to massive packet reordering. Poor TCP performance due to massive reordering

Improved GRO to Mask Reordering for TCP
TCP/IP GRO NIC P1 P2 P3 P6 P4 P7 P5 P8 P9 Flowcell #1 Flowcell #2

TCP/IP P1 GRO NIC P2 P3 P6 P4 P7 P5 P8 P9 Flowcell #1 Flowcell #2

TCP/IP P1 – P2 GRO NIC P3 P6 P4 P7 P5 P8 P9 Flowcell #1 Flowcell #2

TCP/IP P1 – P3 GRO NIC P6 P4 P7 P5 P8 P9 Flowcell #1 Flowcell #2

TCP/IP P1 – P3 P6 GRO NIC P4 P7 P5 P8 P9 Idea: we merge packets in the same flowcell into one TCP segment, then we check whether the segments are in order Flowcell #1 Flowcell #2

TCP/IP P1 – P4 P6 GRO NIC P7 P5 P8 P9 Flowcell #1 Flowcell #2

TCP/IP P1 – P4 P6 – P7 GRO NIC P5 P8 P9 Flowcell #1 Flowcell #2

TCP/IP P1 – P5 P6 – P7 GRO NIC P8 P9 Flowcell #1 Flowcell #2

TCP/IP P1 – P5 P6 – P8 GRO NIC P9 Flowcell #1 Flowcell #2

TCP/IP P1 – P5 P6 – P9 GRO NIC Flowcell #1 Flowcell #2

P1 – P5 P6 – P9 TCP/IP GRO NIC Flowcell #1 Flowcell #2

Benefits: 1)Large TCP segments pushed up, CPU efficient 2)Mask packet reordering for TCP below transport Issue: How we can tell loss from reordering? Both create gaps in sequence numbers Loss should be pushed up immediately Reordered packets held and put in order

Loss vs Reordering Presto Sender: packets in one flowcell are sent on the same path (64KB flowcell ~ 51 us on 10G networks) Heuristic: sequence number gap within a flowcell is assumed to be loss Action: no need to wait, push-up immediately

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 P2 P3

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 P3 – P5

Loss vs Reordering TCP/IP GRO NIC ✗ No wait Flowcell #1 Flowcell #2 P1
P3 – P5 P6 – P9 TCP/IP No wait GRO NIC P2 ✗ Flowcell #1 Flowcell #2

Loss vs Reordering Benefits:
Most of losses happen within a flowcell and are captured by this heuristic TCP can react quickly to losses Corner Case: Losses at the flowcell boundaries

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 P2 P3

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 – P5

(an estimation of the extent of reordering)
Loss vs Reordering P1 – P5 TCP/IP P7 – P9 GRO NIC P6 ✗ Wait based on adaptive timeout (an estimation of the extent of reordering) Flowcell #1 Flowcell #2

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 – P5

CS434/534: Topics in Network Systems Cloud Data Centers: VL2 Control; VLB/ECMP Load Balancing Routing Yang (Richard) Yang Computer Science Department.

Similar presentations

Presentation on theme: "CS434/534: Topics in Network Systems Cloud Data Centers: VL2 Control; VLB/ECMP Load Balancing Routing Yang (Richard) Yang Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS434/534: Topics in Network Systems Cloud Data Centers: VL2 Control; VLB/ECMP Load Balancing Routing Yang (Richard) Yang Computer Science Department.

Similar presentations

Presentation on theme: "CS434/534: Topics in Network Systems Cloud Data Centers: VL2 Control; VLB/ECMP Load Balancing Routing Yang (Richard) Yang Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback