Achieving reliable high performance in LFNs (long-fat networks)

Achieving reliable high performance in LFNs (long-fat networks)
Sven Ubik, Pavel Cimbál CESNET I would like to talk about some issues that need to be considered when we want to reliably achieve high-performance over long-distance high-speed networks or long-fat networks named so because of the high possible volume of outstanding data sent by the sender by not yet acknowledged by the receiver, that fits into the network and that can cause some performance problems.

End-to-end performance
E2E performance is a result of interaction of all computer system components: network, network adapter, communication protocols, operating system, applications Decided to concentrate on E2E performance on Linux PCs Primary problem areas: - TCP window configuration - OS / network adapter interaction - ssh / TCP interaction It turns out that the performance observed by the user between two end-hosts depends not only on the characteristics of the network, but it is a result of interactions between all components along the communication path. That is beginning with the application over the operating systém, communication protocols, the network adapters and the network itself. If some of these components is slow, inappropriately configured or if there is a problem in interaction between some components, the resulting performance can be much lower than what you would expect from the physical speed of the network. Many of the advanced Internet users who require high performance are now using the Linux operating systém. We therefore decided to concentrate in the first phase on end-to-end performance when using Linux. Given that, there seems to be three primary problem areas: - TCP window configuration - how big the windows should be and how we configure them - the interaction between the OS and the network adapter - and the interaction between ssh, the popular protocol for safe data transfer and TCP Let‘s first look at TCP buffer configuration. You probably know that to achieve good performance over a high-speed long-distance network you need to increase the size of TCP buffers from the usual default size of 64 kB, because the size of TCP buffers is a limit on the volume of data that can be transfered in one RTT. You can increase the TCP buffers manually for one application, you can change default for all applications or you can use some autoconfiguration tool, such as DRS or WAD.

TCP windows configuration
Linux 2.4: how big are my TCP windows? Whatever method you choose, one of two things happens - either sysctl command is executed to change OS defaults or setsockopt() call is executed to change parameters of a particular connection. However, what you should consider is that the Linux kernel applies some internal arithmetics to the values that you supply, which results in TCP buffer sizes and consequently in TCP windows which are different from what you specified. You might have noticed that when you specify say 1 MB window when using iperf to measure throughput, iperf shows that it uses 2 MB window. The reason is that the values supplied to setsockopt() call are internally doubled and that value is returned by getsockopt(). But what says iperf is still not right, because Linux takes part of this doubled value away as an application buffer and only the rest is used for the TCP window. And this TCP window is even dynamically moderated during the connection to prevent sudden bursts from the sender. These features can be configured with systém variables and should be considered when we want to do performance measurements with reproducible results.

Throughput with large TCP windows
Interaction with network adapter must be considered: large TCP sender window allows large chunks of data submitted from IP through txqueue to adapter full txqueue -> send_stall() to application and context switch no problem as long as txqueue is large enough for a timeslice Given that we specify TCP windows that we want, we should consider the interaction between the IP layer and the NIC. When an application has a lots of data to send and when the TCP sender has a large available window, it can submit a large chunk of data at once to the network adapter. But the network adapter can send packets only at the physical network speed, say 1 Gb/s. Therefore there is a transmision queue in front of the network adapter to store packets submitted to be sent. When this buffer fills, the application received send_stall() signal and the OS switched context to another process. This is no problem at all as long as the buffer is large enough so that it can supply the network adapter with fresh packets until the context is switched back to the sending application. With standard Linux systém timer it can také up to 10 ms. You can easily calculate that to saturate a Gigabit Ethernet adapter for 10 ms in 1500-byte packets (I think that we do not need to consider smaller packets because large volumes of data are normally sent in large packets) we need a buffer of at least 833 packet. Therefore it is a good idea to use ifconfig command with txqueuelen parameter and increase the transmission queue from the default value of 100 to say 1000 packets. for Gigabit Ethernet adapter and standard Linux system timer: txqueuelen > 1 Gb/s * 10 ms / 8 bits / 1500 bytes = 833 packets ifconfig eth0 txqueuelen 1000

Throughput with large TCP windows, cont.
This picture illustrates throughput achieved with different TCP windows over one of our testing paths from Cesnet to Uninett in several rounds of measurements. You can see that with the default value of transmission queue throughput began to decrease with window size over 2 MB. Increased transmission queue improved throughput significantly, but with really large windows it began to decrease again. There are more reasons to that, but the most important one is overfilling router queues.

Using “buffered pipe” is not good
Router queues must be considered: No increase in throughput over using „wire pipe“ Self-clocking adjusts sender to bottleneck speed, but does not stop sender from accumulating data in queues Filled-up queues are sensitive to losses caused by cross-traffic Check throughput (TCP Vegas) or RTT increase ? rwnd<=pipe capacity bw=rwnd/rtt rwnd>pipe capacity bw~(mss/rtt)*1/sqrt(p) Flat lower bound RTT=45ms Fluctuations up to RTT=110ms Bottleneck installed BW=1 Gb/s Buffer content ~8 MB TCP is self-clocking in that the sender can send packets only at the rate at which acknowledgements are coming and in that way TCP adjusts to the bottleneck speed. This however does not prevent sender from still increasing its congestion window using AIMD algorithm and thus increase the volume of outstanding data that fill up in router queues. All TCP connections do that and the filled router queues become sensitive to overflow due to fluctuations in user traffic. At some point losses occur and senders are forced to slow down. We can assume that if we manage to stop increase the congestion window when router queues are beginning to fill, we can prevent losses and increase throughput. TCP Vegas and several other works tried to look when throughput stopped to increase or did not correspond to the volume of oustanding data to stop window increase. But it turns out that TCP Vegas has problems with fairness and sometimes it achieves even lower throuhput than standard Reno TCP on the same network conditions. This picture shows RTT development during some connection. The lower bound was given by RTT on empty network and was some 45 ms. The fluctuations that peaked at about 110 ms were caused by fluctuations of data in router queues. At this point a loss occured and the sender had to slow down. It seems that checking RTT development can also be used to better moderate the sender TCP window to prevent losses. We are currently working on a modified TCP that will have this functionality. It is not yet complete,

Other configuration problems
TCP cache must be considered owin development rwin development But what I can show is the effect of another performance problem. TCP has a cache that remembers some connection parameters for individual destinations for 10 minutes. One of these parameters is ssthresh, which is a threshold between the slow start and congestion avoidance connection phases. When one connection goes wrong and sshthresh is significantly decreased due to multiple losses, than all subsequent connections to the same destinations begin with the same low ssthresh. This picture shows development of outstanding data. You can see that at this point the exponential increase of the slow start phase ended suddenly and then the window was increased only very slowly in congestion avoidance. Throughput was locked at value given by this small window. This is basically the same situation illustrated by congestion window development obtained from web100 (here is a small typo it should be cwnd not rwnd). This TCP cache can be flushed by writting 1 to this systém variable. initial ssthresh locked at 1.45 MB echo 1 > /proc/sys/net/ipv4/route/flush

Bandwidth measurement and estimation
Test paths: cesnet.cz <--> uninett.no, switch.ch pathload over one week: 27% measurements too low (50-70 Mb/s) 7% measurements too high (1000 Mb/s) 66% measurements realistic ( Mb/s), but range sometimes too wide (150 Mb/s) pathrate: lots of fluctuations UDP iperf: can stress existing traffic TCP iperf: more fluctuations for larger TCP windows So it appears that until we have a reliable window autoconfiguration that will not only increase the window large enough, but also prevent them to grow too large, we can achieve the best throughput over a particular path if we measure or estimate the available bandwidth of that path and set the TCP windows accordingly. If we do not want to stress the network by testing traffic, we can try to use pathload, which is currently the only publicly available tool for available bandwidth estimation. We did quite a lot of tests with pathload over different network paths and our experience is that normally its results coukd be divided into three categories: - in 27% the indicated bandwidth was unrealisticly low - in 7% it was unrealistically high - in the remaining cases it looked realistically, but sometimes the indicated range was too wide We can do more measurements to increase the probability that we get result with some precision, but this can také too long for practical purposes. We also tried pathrate which should indicated the installed bottleneck bandwidth and we found that it exhibits a lot of fluctuations as well. For comparison we did measurements with iperf with different window sizes and numbers of streams. As expected, large TCP windows caused a lot of fluctuations between different measurements due to different loss patterns.

Bandwidth measurement and estimation, cont.
cesnet.cz -> switch.ch This picture shows measurements on one of our testing paths from cesnet to switch. This green line is pathrate, as you see, there are lots of fluctuations, not much useful. This line is UDP iperf, hovering at some 950 Mb/s. The flat sections were probably limited only by the end hosts. It is not clear whether there really was so much available bandwidth, it could easily happen that other traffic which was mostly TCP was stressed by this UDP. These lines are TCP iperf with different windows and the red line is pathload. It is somewhere between UDP and TCP iperf, could be realistic, but probably more as a matter of luck than a systematic property.

Bandwidth measurement and estimation, cont.
uninett.no -> cesnet.cz This picture shows results from another testing path, from uninett to cesnet. Pathload is again shown by the red line. For reason which is not quite clear it showed unrealistically low throughout all the time. Another point which may not be seen clearly from this picture is that iperf with larger windows exhibited more fluctuations than with smaller windows. We repeated test with each particular window size several times and the range is indicated by a vertical bar. Our experience is that when iperf throughput begins to fluctuate between different measurements with some window, it is probably a good indication that we approached the available bandwidth and are causing losses due to buffer overflow.

ssh performance Cesnet -> Uninett, 1.5 MB window, 10.4 Mb/s, 9% load CPU Ok, it is nice when we manage to configure TCP windows to achieve a high throughput with iperf. But people are not transfering their files with iperf. One of the most popular file transfer tools is scp, which uses ssh protocol for encryption and authentication. It turns out that ssh implementation has a significant performance problem. We set the TCP window to 1.5 MB on our testing path from Cesnet to Uninett and achieved a very low throughput of 10 Mb/s. This picture shows development of sequence numbers obtained from tcptrace and xplot. You can see that sequence numbers are kept close to acknowledgement numbers, not using the receiver advertised window. Every time when scp is slow, many people tend to blame high CPU load due to encryption and decrytpion. Well we checked the CPU load and it was about 9%. So what was the problem. sequence number development - rwin not utilised

ssh performance, cont. For encryption purposes, the ssh algorithm packs data in blocks, each is by default 32 kB big. Delivery of these blocks is acknowledged at the application level. And for these acknowledgement ssh maintains its own internal window above the TCP window. Default size of the internal window is blocks, so the maximum volume of outstanding data is 128 kB, which is the limit for data that can be transfered in one RTT.

ssh performance, cont. Bw=1 Gb/s, RTT=45 ms, TCP window=8 MB, Xeon 2.4 GHz We patched ssh so that the size of this internal window can be adjusted. This picture shows achieved throughout and corresponding CPU load depending on the size of internal SSH window. Throughput with default ssh was 10 Mb/s. With increased internal window, we achieved throughput up to 180 Mb/s over the same network path with the same data. Obviously CPU load increased correspondingly. But the CPU load was not the cause of the lowe initial throughput, it was caused by small SSH internal window.

ssh performance, cont. sequence number development - rwin utilised
CHAN_SES_WINDOW_DEFAULT=40 * 32 kB blocks, 85% CPU load And this picture shows sequence number development with internal window increased to 40 blocks. You can see that the sender is now able to utilize the receiver advertised window. sequence number development - rwin utilised

Conclusion Large TCP windows require other configuration tasks for good performance Buffer autoconfiguration should not just conserve memory and set TCP windows „large enough“, but also „small enough“ ssh has a significant performance problem that must be resolved at the application level Influence of OS configuration and implementation-specific features on performance can be stronger than amendments in congestion control Thank you for your attention So to conclude we can say that large TCP windowa require also other configuration tasks for good performance, such as taking care of transmission queue before the network adapter and the TCP cache. Any kind of buffer and TCP windows autoconfiguration should not only conserve memory by not allocating large buffers for connections that do not need them and look that TCP windows are large enough when needed, it should also moderate TCP windows during the connection such that router queues are not overfilled. Ssh has a significant performance problem due to its internal window, which must be resolved at the application level. And finally, when we see all this, we can note that the influence of OS configuration and of some implementation-specific features on performance can be sometimes more important than slight improvements in congestion control algorithms proposed in internet drafts or elsewhere and therefore, we should také care of them.

Achieving reliable high performance in LFNs (long-fat networks)

Similar presentations

Presentation on theme: "Achieving reliable high performance in LFNs (long-fat networks)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Achieving reliable high performance in LFNs (long-fat networks)

Similar presentations

Presentation on theme: "Achieving reliable high performance in LFNs (long-fat networks)"— Presentation transcript:

Similar presentations

About project

Feedback