Practical TDMA for Datacenter Ethernet

Practical TDMA for Datacenter Ethernet
Bhanu C. Vattikonda, George Porter, Amin Vahdat, Alex C. Snoeren Hello everyone, my name is Bhanu Vattikonda and I am here to present our work on TDMA for Datacenter Ethernet. This is joint work with George Porter, Amin Vahdat and my adviser Alex Snoeren

Gather/Scatter All-to-all
Variety of applications hosted in datacenters All-to-all Gather/Scatter Datacenters host a wide variety of applications which generate different kinds of traffic. On one hand applications like Hadoop MapReduce generate throughput sensitive traffic in the shuffle phase. On the other hand applications like Memcached generate traffic which is latency sensitive. All this traffic is uncoordinated which leads to poor performance for all the flows involved. We have job trackers and other central managers in the data center which coordinate jobs for better performance but the network itself is treated a black box. This lack of coordination hurts the performance of the applications. Performance depends on throughput sensitive traffic in shuffle phase Generate latency sensitive traffic

Network is treated as a black-box
Applications like Hadoop MapReduce perform inefficiently Applications like Memcached experience high latency Why does the lack of coordination hurt performance? Since the network is treated as a black box, applications like MapReduce perform inefficiently and applications like MapReduce experience high latency for the requests. Question: how does the lack of coordination manifest itself?

Example datacenter scenario
Bulk transfer Latency sensitive Traffic receiver Answer: To see how this could happen, let us consider a simplified datacenter example with 4 hosts and 3 switches. Two nodes, shown here perform bulk transfers, similar to the transfers seen in the MapReduce shuffle phase. One node sends a latency sensitive traffic (similar to the short flows seen in the case of Memcached) across the same network. Latency sensitive Bulk transfer Bulk transfer

Drops and queuing lead to poor performance
Traffic receiver Bulk transfer traffic experiences packet drops Latency sensitive traffic gets queued in the buffers Let us start with the bulk transfers. Flows fill the buffers. TCP may experience drops. Due to the way TCP works, when flows share a link, they may experience packet losses leading to lower throughput. At the same time, flow which wants lower latency experiences higher latency because it’s packets get queued in the switch buffers. Takeaway: thus, we see that since traffic is not coordinated, bulk transfers like MapReduce flows achieve lower throughput and latency sensitive transfers like Memcached experience higher latency. Latency sensitive Bulk transfer Bulk transfer

Current solutions do not take a holistic approach
Facebook uses a custom UDP based transport protocol Alternative transport protocols like DCTCP address TCP shortcomings Infiniband, Myrinet offer boutique hardware solutions to address these problems but are expensive Since the demand can be anticipated, can we coordinate hosts? Since this is an important problem, several solutions have been proposed to address this. To name a few, Facebook, whose entire architecture depends on Memcached and requires lower latency, reportedly uses a custom UDP based transport protocol to achieve the desired low latency. Alternatively solutions like DCTCP try to address the shortcomings in TCP. None of them address the whole problem, thus, hardware vendors offer boutique hardware solutions which are expensive. These are expensive because they are general solutions. Datacenter on the other hand is unique. Demand can be anticipated. Takeaway: current solutions do not take a holistic look at the problem Observation: If the flows can be coordinated then all these problems can be solved. A lot of traffic patterns give us the opportunity to predict the demand and hence the traffic can be coordinated.

Taking turns to transmit packets
Receiver Time Division Multiple Access If hosts take turns, then the communication pattern would look something like this, first the node sending red colored packets gets a chance to send the packets, then the node sending green colored packets gets a chance to send. Finally when the latency sensitive traffic is sent and the buffers are not occupied, so it goes right through, without any delays. We might as well not have any buffers. This idea of taking turns is similar to what happens at a traffic light. In the context of networks, it has a technical name to it. TIME DIVISION MULTIPLE ACCESS or TDMA. Takeaway: This is TDMA, an age old idea Bulk transfer Bulk transfer Latency sensitive

TDMA: An old technique Question: TDMA is an old technique, goes back to 1940’s, then why is not used in the datacenter? Break!

Enforcing TDMA is difficult
It is not practical to task hosts with keeping track of time and controlling transmissions End host clocks quickly go out of synchronization Answer: because it is tough to do so. TDMA requires hosts to send data when it is their turn and only when it is their turn. To do this, hosts need to keep track of time and send when it is their turn to do so. But, it is not practical to task hosts with keeping track of time and controlling transmissions. To show this challenge, we did an experiment. As you can see in this graph, in just about 10 seconds, host clocks can diverge by about 2000us. This is approximately 300 9k packets. Takeaway: So, absent special timing hardware end hosts cannot be tasked with controlling their transmissions

Existing TDMA solutions need special support
Since end host clocks cannot be synchronized, special support is needed from the network FTT-Ethernet, RTL-TEP, TT-Ethernet require modified switching hardware Even with special support, the hosts need to run real time operating systems to enforce TDMA FTT-Ethernet, RTL-TEP Existing TDMA solutions employ modified switching hardware to enforce TDMA to overcome the problem of unsynchronized clocks. Even with this special support from the network, the hosts need to be able to respond to the control packets they receive. For this some solutions use real time operating systems as well. Takeaway: doing TDMA is difficult and special support is needed Question: It sounds like TDMA could solve a good chunk of our problems, but implementing TDMA seems tough. Can we do TDMA with commodity Ethernet which is dominant in data centers and without special real time operating systems. Break! Can we do TDMA with commodity Ethernet?

TDMA using Pause Frames
Flow control packets (pause frames) can be used to control Ethernet transmissions Pause frames are processed in hardware Very efficient processing of the flow control packets Blast UDP packets 802.3x Pause frames Answer: It turns out that the Ethernet standard has something that might help us here. It has a signaling mechanism called flow control which defines pause frames using which end host transmissions can be controlled. Takeaway: so, there is a solution, use pause frames. Question: how good are they? Measure time taken by sender to react to the pause frames

TDMA using Pause Frames
Pause frames processed in hardware Very efficient processing of the flow control packets Reaction time to pause frames is 2 – 6 μs Low variance Takeaway: The protocol requires that the control packets be processed very efficiently thus flow control packets are processed in hardware very efficiently with very low variance. May be we do not need boutique hardware. Since such hardware support already exists. Question: so how can we use this signaling mechanism? In this graph we show the reaction time to every pause frame. So, we send a pause frame, we see the time it takes for the host to stop transmission. The y-axis shows the time. This is a pretty good result, because this is a 10GE link, including the propagation delay, the end host reacts to the pause frame in about the time it takes to send one packet. * Measurement done using 802.3x pause frames

TDMA using commodity hardware
Collect demand information from the end hosts Collect demand information from the end hosts TDMA imposed over Ethernet using a centralized fabric manager Compute the schedule for communication Compute the schedule for communication Control end host transmissions Now that we have a signaling mechanism for controlling end host transmissions, how can the hosts take turns communicating? We propose using a centralized fabric manager to control the transmissions. The fabric manager collects demand information from the end hosts Computes a schedule for communication which determines which host transmits when and to who Then controls the end host transmissions to enforce the schedule computed above This process repeats itself The focus of this talk is to see if end host transmissions can be controlled. Observation: so, let me show you a simple example to show how such a system would work.

TDMA example Collect demand information from the end hosts D1
Compute the schedule for communication S D2 S –> D1: 1MB S –> D2: 1MB Control end host transmissions Answer: Consider this simple example which illustrates how the system would work. We would first collect demand from the end hosts, then, compute a schedule for communication and finally begin to control the end host transmissions. For now, let us suppose that the end host is also aware of the communication schedule, then we would have to signal the end host to take it through the rounds of communication. I will describe later, how the signal itself can contain the schedule information. Question: As you can see this is relatively simple with one end host. how would this work if there are multiple end hosts? round1 round2 round1: S -> D1 round2: S -> D2 round3: S -> D1 round4: S -> D2 … Schedule Fabric manager

More than one host Fabric manager
Control packets should be processed with low variance Control packets should arrive at the end hosts synchronously round1 round1 round1 round1 round2 round2 round2 round2 Answer: If we increase the number of hosts that the fabric manager has to control then we would have the following scenario. First the fabric manager would send control packets beginning round1 then it would send control packets starting round2. In order that all the hosts respond synchronously to the control packets, the control packets should be processed with low variance and the control packets should arrive at the end hosts synchronously. We have seen that the control packets can be processed with low variance earlier. Question: Can the control packets arrive at the end hosts synchronously? Break!

Synchronized arrival of control packets
We cannot directly measure the synchronous arrival Difference in arrival of a pair of control packets at 24 hosts To answer this question, ideally we can measure the time at which all the hosts receive control packets of a round and compare time stamps. The time stamps of the packets would be same. But, remember, the end host clocks are not synchronized. So, we measure the difference in arrival of a pair of control packets.

Synchronized arrival of control packets
Difference in arrival of a pair of control packets at 24 hosts Variation of ~15μs for different sending rates at end hosts Each line in the graph represents the difference in arrival of a pair of control packets at 24 hosts for different rates of transmissions at the end hosts. The variation is approximately 15μs. Thus, if we send the control packets, the end hosts are going to be unsynchronized for about 15μs. Thus, we choose a guard time of 15μs between the TDMA slots. Answer: so, not exactly, the control packets cannot arrive at end hosts synchronously, but we have a very tight bound. The margin of error is 15us. So, how does this affect the design of our system?

Ideal scenario: control packets arrive synchronously
round2 round3 Round 1 Round 2 Round 3 Host A Round 1 Round 2 Round 3 Host B Ideally all the hosts would receive the control packets at the same time which would ensure that the hosts are synchronized round2 round3

Experiments show that packets do not arrive synchronously
round2 Round 1 Round 2 Round 3 Host A Round 1 Round 2 Round 3 Host B But, as our experiments show, the control packets do not arrive at the same time. There is a 15μs margin of error. To see how this effects the design of the system, let us consider two hosts A and B which are currently in round 1, now when we send the control packets for round 2, they do not arrive synchronously. The margin is about 15μs. In this period host A is in round1 and host B is in round2 violating TDMA. Same happens when round 3 control packets are sent. Takeaway: This is a problem because host A in round 1 and host B in round 2 could be sending to the same destination and hence would have packet drops and violate TDMA. Question: How can we address this? Out of sync by <15μs round2

Guard times to handle lack of synchronization
Stop round2 Round 1 Round 2 Round 3 Host A Round 1 Round 2 Round 3 Host B Answer: To address this issue, we use a standard technique used by TDMA based solutions, which is employ guard times. After round 1, we send another control packet to both the hosts stopping all traffic. After some time, we send the control packets beginning round 2. This time in which hosts are not transmitting any data is guard time. This ensures that there are no overlaps in TDMA rounds. Guard times (15μs) handle out of sync control packets Stop round2

TDMA for Datacenter Ethernet
Use flow control packets to achieve low variance Guard times adjust for variance in control packet arrival Control end host transmissions To summarize, in controlling the end hosts we do the following. We use flow control packets to control end host transmissions Introduce guard times to handle overlap in TDMA rounds Break!

Encoding scheduling information
We use IEEE 802.1Qbb priority flow control frames to encode scheduling information Using iptables rules, traffic for different destinations can be classified into different Ethernet classes 802.1Qbb priority flow control frames can then be used to selectively start transmission of packets to a destination Till now I have not mentioned how to encode scheduling information into the control packet. One way would be to send the schedule ahead of time, another more efficient way would be to send the schedule information in the control frames themselves. We use 802.1Qbb priority flow control frames which allow us to selectively pause or un-pause traffic. We classify the traffic for different destinations into different Ethernet classes. When traffic for a certain destination has to be allowed, we un-pause traffic of the corresponding class Question: how does all this work?

Methodology to enforce TDMA slots
Pause all traffic Un-pause traffic to a particular destination Answer: We do the following, stop all the traffic that a host is sending. Then, we allow traffic to one destination. Once the TDMA slot is done, we can stop all the traffic again. The time between pausing all traffic and un-pausing traffic to a destination is the guard time. Takeaway: we can control end hosts to transmit packets to a particular destination in a round. Break! Pause all traffic to begin the guard time

Evaluation MapReduce shuffle phase Memcached like workloads
All to all transfer Memcached like workloads Latency between nodes in a mixed environment in presence of background flows Hybrid electrical and optical switch architectures Performance in dynamic network topologies Now that we know how to enforce TDMA, everything should work? We demonstrate that TDMA can indeed improve performance in different application scenarios: To see the performance gain that bulk transfers can achieve, we emulate a MapReduce shuffle phase with an all to all transfer. To show the reduction in latency for applications like Memcached in a mixed environment, we measure the latency between nodes in the presence of background flows Finally, we also note that the nature of TDMA makes it highly suitable for the latest datacenter architectures which employ a mix of electrical and optical switches that dynamically redirect flows over links of varying capacities.

Experimental setup 24 servers 1 Cisco Nexus 5000 series 10G
HP DL380 Dual Myricom 10G NICs with kernel bypass to access packets 1 Cisco Nexus 5000 series 10G 96-port switch,1 Cisco Nexus 5000 series 10G 52-port switch 300μs TDMA slot and 15μs guard time Effective 5% overhead 300us -> send a little less than 38KB of data.

All to all transfer in multi-hop topology
10GB all to all transfer As mentioned, to emulate a MapReduce all to all shuffle phase, we consider an all to all transfer between 24 hosts connected as shown. We connect the fabric manager to each of the edge switches so that the fabric manager is equi-distant from the end hosts and hence the control packets can arrive in a synchronized manner. 8 Hosts 8 Hosts 8 Hosts

All to all transfer in multi-hop topology
10GB all to all transfer We use a simple round robin scheduler at each level 5% inefficiency owing to guard time TCP all to all Ideal transfer time: 1024s TCP performance looks as shown in the graph. Each line of the graph indicates the progress of one flow. Some flows finish quicker because they are intra-switch. The transfer time is far from ideal. Symptom: this happens because TCP is in charge of coordinating the usage of shared links which is used by a lot of flows (128 on the highest contended link). There is work which shows that when a lot of flows try to share the bottleneck link, TCP link sharing is not fair. Cure: if instead the flows are coordinated to make sure that at any given point, only one flow is on a link, then we get near ideal performance. We also do not use TCP because there is only one flow per link. We are almost ideal, there is a 5% inefficiency shown in the graph is owing to the guard time. TDMA all to all 8 Hosts 8 Hosts 8 Hosts

Latency in the presence of background flows
Start both bulk transfers Measure latency between nodes using UDP Now let us return to the example I showed you at the beginning of this talk. In that experiment two nodes were sending bulk traffic and one node was sending latency sensitive traffic. One of the issues was that the latency sensitive traffic was getting queued in the buffers and leading to higher latency. To show the improvement that a TDMA based system can achieve, we do the following experiment. In TDMA since each flow is sending in it’s dedicated slot, the buffers are always empty. Start both bulk transfers Measure latency between nodes using UDP Bulk transfer Latency sensitive Receiver Bulk transfer

Latency in the presence of background flows
Latency between the nodes in presence of TCP flows is high and variable TDMA system achieves lower latency TCP TDMA TDMA with Kernel bypass Symptom: as we saw earlier, in the presence of large background flows, the latency sensitive packets often get queued up in the switch buffers leading to lower performance. Cure: TDMA should make sure the buffers are empty. Hence we achieve a really low latency.

Adapting to dynamic network configurations
Optical circuit switch Electrical packet switch More recently, new network architectures have been proposed which use a mix of electrical and optical switches. The optical switches offer substantially higher bandwidth when the flows are scheduled to use them. When the flow uses the electrical path it gets lower bandwidth. The optical switch, switching time is high, so, once a flow is scheduled, it is allowed to use the link for a good amount of time. As better switches get produced and the switching time is reduced you may want to switch the flows at a much faster rate.

Link capacity between the hosts is varied between 10Gbps and 1Gbps every 10ms Receiver Sender Ideal performance

Link capacity between the hosts is varied between 10Gbps and 1Gbps every 10ms Receiver Sender Symptom: TCP performance here is poor because of two reasons, 1) when the flow is switched from the 10G link to the 1G link, packet losses happen and this leads to poor performance and 2) when it is switched from 1G to 10G it takes a long time to adapt. TCP performance

TDMA better suited since it prevents packet losses Cure: Make sure losses never happen so that TCP does not have to adapt frequently. The scheduler that we use, gives slots such that … TCP performance

Conclusion TDMA can be achieved using commodity hardware
Leverage existing Ethernet standards TDMA can lead to performance gains in current networks 15% shorter finish times for all to all transfers 3x lower latency TDMA is well positioned for emerging network architectures which use dynamic topologies 2.5x throughput improvement in dynamic network settings

Thank You

Practical TDMA for Datacenter Ethernet

Similar presentations

Presentation on theme: "Practical TDMA for Datacenter Ethernet"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Practical TDMA for Datacenter Ethernet

Similar presentations

Presentation on theme: "Practical TDMA for Datacenter Ethernet"— Presentation transcript:

Similar presentations

About project

Feedback