RDMA vs TCP experiment
Goal Environment Test tool - iperf Test Suits Conclusion
Goal Test maximum and average bandwidth usage in 40Gbps(Infiniband) and 10Gbps(iWARP) network environment Compare CPU usage between TCP and RDMA data transfer mode Compare CPU usage between RDMA READ and RDMA WRITE mode
Environment 40 Gbps Infiniband 10 Gbps iWARP Netqos03/client Netqos04/server Whether there is a switch? Between the two server?
Tool - iperf Migrate iperf 2.0.5 to the RDMA environment with OFED(librdmacm and libibverbs). 2000+ Source Lines of Code added. From 8382 to 10562. iperf usage extended -H: RDMA transfer mode instead of TCP/UDP -G: pr(passive read) pw(passive write) Data read from server. Server writes into clients. -O: output data file, both TCP server and RDMA server Only one stream to transfer
Test Suits test suits 1: memory -> memory test suits 2: file -> memory -> memory test case 2.1: file(regular file) -> memory -> memory test case 2.2: file(/dev/zero) -> memory -> memory test case 2.3: file(lustre) -> memory -> memory test suits 3: memory -> memory -> file test case 3.1: memory -> memory -> file(regular file) test case 3.2: memory -> memory -> file(/dev/null) test case 3.3: memory -> memory -> file(lustre) test suits 4: file -> memory -> memory -> file test case 4.1: file ( regular file) -> memory -> memory -> file( regular file) test case 4.2: file(/dev/zero) -> memory -> memory -> file(/dev/null) test case 4.3: file(lustre) -> memory -> memory -> file(lustre)
File choice File operation with Standard I/O library fread, fwrite, Cached by OS Input with /dev/zero wants to test the maximum application data transfer include file operation – read, which means disk is not the bottleneck Output with /dev/null wants to test the maximum application data transfer include file operation – write, which means disk is not the bottleneck
Buffer choice RDMA operation block size is 10MB RDMA READ/WRITE one time Previous experiment shows that, in this environment, if the block size more than 5MB, there is little effect to the transfer speed TCP read/write buffer size is the default TCP window size: 85.3 KByte (default)
Test case 1: memory -> memory CPU
Test case 1: memory -> memory Bandwidth RDMA speed is limited by PCI Express bus.
Test case 2.1: (fread) file(regular file) -> memory -> memory CPU
Test case 2.1: (fread) file(regular file) -> memory -> memory Bandwidth Speed are limited by disk.
Test case 2.2 (five minutes) file(/dev/zero) -> memory -> memory CPU
Test case 2.2 (five minutes) file(/dev/zero) -> memory -> memory Bandwidth
Test case 3.1 (200G file are generated): memory -> memory -> file(regular file) CPU Bandwidths, limited by disk write, are almost the same.
Test case 3.1 (200G file are generated): memory -> memory -> file(regular file) Bandwidth Bandwidths are almost the same!
Test case 3.2: memory -> memory -> file(/dev/null) CPU
Test case 3.2: memory -> memory -> file(/dev/null) Bandwidth
Test case 4.1: file(r) -> memory -> memory -> file(r) CPU
Test case 4.1: file(r) -> memory -> memory -> file(r) Bandwidth
Test case 4.2: file(/dev/zero) -> memory -> memory -> file(/dev/null) CPU
Test case 4.2: file(/dev/zero) -> memory -> memory -> file(/dev/null) Bandwidth
Conclusion For one data transfer stream, the RDMA transport is twice as fast as TCP, while the RDMA has only 10% of CPU load compare with the CPU load under TCP, without disk operation. FTP includes two components: Networking and File operation. Compare with the RDMA operation, file operation (limited by the disk performance) takes most of the CPU usage. Therefore, a well-designed file buffer mode is critical.
Future work Setup Lustre environment, and configure Lustre with RDMA function Startup FTP migration Source control Bug database Document etc (refer to The Joel Test)
Memory Cached Cleanup # sync # echo 3 > /proc/sys/vm/drop_caches