Data Transfer Node Performance GÉANT & AENEAS

Slides:



Advertisements
Similar presentations
TERENA Networking Conference, Lyngby, May 2007, R. Hughes-Jones Manchester 1 The Performance of High Throughput Data Flows for e-VLBI in Europe Multiple.
Advertisements

RHEL6 tuning guide for mellanox ethernet card.
MB - NG MB-NG Technical Meeting 03 May 02 R. Hughes-Jones Manchester 1 Task2 Traffic Generation and Measurement Definitions Pass-1.
DataTAG CERN Oct 2002 R. Hughes-Jones Manchester Initial Performance Measurements With DataTAG PCs Gigabit Ethernet NICs (Work in progress Oct 02)
Bridging. Bridge Functions To extend size of LANs either geographically or in terms number of users. − Protocols that include collisions can be performed.
JIVE VLBI Network Meeting 15 Jan 2003 R. Hughes-Jones Manchester The EVN-NREN Project Richard Hughes-Jones The University of Manchester.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Meeting on ATLAS Remote Farms. Copenhagen 11 May 2004 R. Hughes-Jones Manchester Networking for ATLAS Remote Farms Richard Hughes-Jones The University.
CdL was here DataTAG/WP7 Amsterdam June 2002 R. Hughes-Jones Manchester 1 EU DataGrid - Network Monitoring Richard Hughes-Jones, University of Manchester.
DataGrid WP7 Meeting CERN April 2002 R. Hughes-Jones Manchester Some Measurements on the SuperJANET 4 Production Network (UK Work in progress)
DataTAG Meeting CERN 7-8 May 03 R. Hughes-Jones Manchester 1 High Throughput: Progress and Current Results Lots of people helped: MB-NG team at UCL MB-NG.
PFLDNet Argonne Feb 2004 R. Hughes-Jones Manchester 1 UDP Performance and PCI-X Activity of the Intel 10 Gigabit Ethernet Adapter on: HP rx2600 Dual Itanium.
CdL was here DataTAG CERN Sep 2002 R. Hughes-Jones Manchester 1 European Topology: NRNs & Geant SuperJANET4 CERN UvA Manc SURFnet RAL.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
G Robert Grimm New York University Receiver Livelock.
FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
GGF4 Toronto Feb 2002 R. Hughes-Jones Manchester Initial Performance Measurements Gigabit Ethernet NICs 64 bit PCI Motherboards (Work in progress Mar 02)
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Lecture 3 Review of Internet Protocols Transport Layer.
The NE010 iWARP Adapter Gary Montry Senior Scientist
10GE network tests with UDP
Srihari Makineni & Ravi Iyer Communications Technology Lab
CCNA 3 Week 4 Switching Concepts. Copyright © 2005 University of Bolton Introduction Lan design has moved away from using shared media, hubs and repeaters.
High TCP performance over wide area networks Arlington, VA May 8, 2002 Sylvain Ravot CalTech HENP Working Group.
IBM Haifa Research Lab © IBM Corporation IsoStack – Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran,
1 Network Performance Optimisation and Load Balancing Wulf Thannhaeuser.
Connect. Communicate. Collaborate 4 Gigabit Onsala - Jodrell Lightpath for e-VLBI The iNetTest Unit Development of Real Time eVLBI at Jodrell Bank Observatory.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.
Lecture 4 Overview. Ethernet Data Link Layer protocol Ethernet (IEEE 802.3) is widely used Supported by a variety of physical layer implementations Multi-access.
CAIDA Bandwidth Estimation Meeting San Diego June 2002 R. Hughes-Jones Manchester UDPmon and TCPstream Tools to understand Network Performance Richard.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Memory and network stack tuning in Linux:
Renesas Electronics America Inc. © 2010 Renesas Electronics America Inc. All rights reserved. Overview of Ethernet Networking A Rev /31/2011.
Use case of RDMA in Symantec storage software stack Om Prakash Agarwal Symantec.
TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot
Networks ∙ Services ∙ People Richard-Hughes Jones eduPERT Training Session, Porto A Hands-On Session udpmon for Network Troubleshooting 18/06/2015.
Final EU Review - 24/03/2004 DataTAG is a project funded by the European Commission under contract IST Richard Hughes-Jones The University of.
1 eVLBI Developments at Jodrell Bank Observatory Ralph Spencer, Richard Hughes- Jones, Simon Casey, Paul Burgess, The University of Manchester.
UDP : User Datagram Protocol 백 일 우
L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,
R&D on data transmission FPGA → PC using UDP over 10-Gigabit Ethernet Domenico Galli Università di Bologna and INFN, Sezione di Bologna XII SuperB Project.
16 th IEEE NPSS Real Time Conference 2009 IHEP, Beijing, China, 12 th May, 2009 High Rate Packets Transmission on 10 Gbit/s Ethernet LAN Using Commodity.
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
Connect. Communicate. Collaborate 4 Gigabit Onsala - Jodrell Lightpath for e-VLBI Richard Hughes-Jones.
Progress on EXPReS at JBO
NaNet Problem: lower communication latency and its fluctuations. How?
CALICE TDAQ Application Network Protocols 10 Gigabit Lab
Scheduler activations
R. Hughes-Jones Manchester
Networking between China and Europe
Transport Protocols over Circuits/VCs
Building 100G DTNs Hurts My Head!
Towards 10Gb/s open-source routing
CS 286 Computer Organization and Architecture
Multi-PCIe socket network device
TLS Receive Side Crypto Offload to NIC
Data Link Issues Relates to Lab 2.
Event Building With Smart NICs
Enabling TSO in OvS-DPDK
MB – NG SuperJANET4 Development Network
ITIS 6167/8167: Network and Information Security
Chapter 13: I/O Systems.
ECE 671 – Lecture 8 Network Adapters.
Achieving reliable high performance in LFNs (long-fat networks)
Presentation transcript:

Data Transfer Node Performance GÉANT & AENEAS 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Richard Hughes-Jones 3rd SIG-PMV Copenhagen Agenda Environment – HW and Topology Tuning UDP & TCP using the Kernel Stack Achievable UDP throughput vs packet spacing Achievable UDP throughput vs packet size Achievable TCP throughput and use of CPU cores, application, NIC ring buffer Achievable TCP throughput ConnectX-4 and ConnectX-5 – multiple flows RDMA using the Kernel Stack RDMA RC Observed Protocol and Packet size RDMA RC throughput vs message size RDMA RC throughput vs packet spacing Kernel bypass with libvma TCP throughput 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Richard Hughes-Jones 3rd SIG-PMV Copenhagen The GÉANT DTN Hardware A lot of help from Boston Labs (London UK) Mellanox (UK & Israel ) Supermicro X10DRT-i+ motherboard Two 6 core 3.40GHz Xeon E5-2643 v3 processors, Mellanox ConnectX-4/5 100 GE NIC 16 lane PCI-e As many interrupts as cores Driver MLNX_OFED_LINUX-4.0-2.0.0.1 NVME SSD Set 8 lane PCI-e Fedora 23 with the 4.4.6-300.fc23.x86_64 kernel NIC NVME QPI Intel QuickPath Interconnect 9.6 GT/s 16 lane 32 GB/s 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Richard Hughes-Jones 3rd SIG-PMV Copenhagen DTN Topology 1 Back to Back, Active Cable, 100GBASE SR-4, 100GBASE LR-4 DTN Qualification in the GÉANT Lab Juniper MX Juniper MX 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Richard Hughes-Jones 3rd SIG-PMV Copenhagen DTN Topology 2 GÉANT DTNs in London & Paris AENEAS DTNs in Jodrell Bank 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Network Tuning for 100 Gigabit Ethernet Hyper threading Turn OFF Wait states Disable Power saving Core Frequency Set governor “performance” Set max CPU frequency NUMA Select correct CPU cores IRQs Turn off the irqbalance Use correct cores for NIC IRQs Interface parameters Interrupt coalescence, Xsum offload, Ring buffer MTU Set IP MTU 9000 Bytes Queues Set txqueuelen, netdev_max_backlog Kernel parameters r(w)mem_max, tcp_r(w)mem, tcp_mem, … Firewalls Better to choose fewer higher speed cores 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

udpmon: UDP Achievable Throughput Ideal shape Flat portions Limited by capacity of link Available BW on a loaded link Shape follows 1/t Packet spacing most important. Cannot send packets back-2-back End host: NIC setup time on PCI / context switches 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

udpmon: Achievable Throughput & Packet loss Move IRQs from core 6, set affinity to lock udpmon to core 6 node 1. Interrupt coalescence on (16us) NIC: ConnectX-4 50 Gbit/s Jumbo size packet should be highest ! 96% kernel sending Swapping between user and kernel mode Recv CPU % idle ?? 37% kernel recv. Also lost packets in the receiving host 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Richard Hughes-Jones 3rd SIG-PMV Copenhagen udpmon_send: How fast can I transmit? Sending Rate as a Function of Packet Size Move IRQs from core 6, set affinity to lock udpmon to core 6 node 1. Step occurs 7813 – 7814 bytes user data Step 0.29 µs Drop 3.6 Gbit/s Step 0.75 µs Drop 14.5 Gbit/s from 43 Gbit/s 50 ps / byte 8 bytes 0.4 ns/64 bits Install and tested Centos 7 kernel 3.10.0-327.el7.x86_64 Centos 6.7 kernel 2.6.32-573.e16.x86_64 Boston Labs OK Collaborate with Mellanox Not driver but the kernel 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

udpmon: Size of Rx Ring Buffer ConnectX-5, set affinity of udpmon to core 6 node 1. Use ethtool -S <enp131s0f0> look at rx_out_of_buffer RX ring 1024 RX ring ≥4096 51 Gbit/s To 8% packet loss No packet loss 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

iperf3: TCP throughput and use of cores Firewalls OFF, TCP offload on, TCP cubic stack Rises smoothly to the plateau – except for the peak Throughput 0.7MByte: 80 Gbit/s both Send & Receive on node 1 35 Gbit/s Send on node 0 Receive on node 1 28 Gbit/s both Send & Receive on node 0 Very few TCP re-transmitted segments observed 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

TCP Throughput iperf2 & iperf3 ConnectX-5, NIC rx buffer 4096, iperf Core6 – core6 While transmitting at 80 Gbit/s the CPU was 98% in kernel mode. iperf3 iperf2 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

TCP Throughput iperf2 effect of Rx Ring Buffer ConnectX-5, iperf Core6 – core6 Correlation of low throughput and re-transmits for Rx ring 1024 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

iperf: TCP throughput multiple flows Distribute IRQs on node 1 Run iperf on cores 6-11 for both send & receive Firewalls ON, TCP offload on, TCP cubic Total throughput: increase 60  90 1  2 flows >95 Gbit/s for 3 & 4 flows Re-transmission 10-4 % 3 flows 10-3 10-5 % 4 flows 10-2 % 8, 10 flows 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Inside a RDMA Application App design places Work Requests on Send Queue or Recv Queue Need to check the Completion Queue – cannot over fill a Queue Deal with low level memory allocation – like for NIC “ring buffers” USER CHANNEL ADAPTER WIRE allocate virtual memory register send queue metadata post_send() . . . Activity blocked control data packets RoCE v2 UDP/ IP completion queue status . . . ACK poll_cq() access Robert D. Russell 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Inside a RDMA Application Fill Recv Queue with Work Requests & ask for notification Wait for event – packet arrival – then poll for status Loop on RecvQ, use data, issue new Recv Work Request RoCE v2 UDP/ IP 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Richard Hughes-Jones 3rd SIG-PMV Copenhagen RDMA RC Max packet size 4096 Bytes, Every message is acknowledged Core6 – core6 the CPU was 90% in user mode. App design takes care of ring buffers poll CompQ every 64 packets RoCE v2 UDP/ IP 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

RDMA RC Throughput as a function of msg size Reasonably smooth increase with mssage size ~ 80 Gbit/s for Jumbo frame messages >90 Gbit/s for 30k Byte messages Not clear what small messages take so long. 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Richard Hughes-Jones 3rd SIG-PMV Copenhagen libvma user mode kernel bypass UDP Throughput as a function of msg size Modify libvma to fragment correctly Fed back to Mellanox & into the distribution. udpmon unmodified run core6 – core6 Firewalls ON Smooth increase UDP > 90 Gbit/s for Jumbo packets 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

libvma UDP Performance Core6 – core 6 Throughput >98 Gbit/s No packet loss Jitter excellent 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

libvma TCP Performance Core6 – core 6 TCP iperf kernel 57.7 Gbit/s TCP iperf libvma 13.6 Gbit/s libvma does not seem to trap iperf3 system calls. 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Richard Hughes-Jones 3rd SIG-PMV Copenhagen Questions ? Thanks to Richard Hughes-Jones 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Richard Hughes-Jones 3rd SIG-PMV Copenhagen Progress & Status Task 4.1: Evaluation of existing data transfer protocols, storage sub-systems and applications Test hosts set up at Onsala, JBO, GÉANT Cam & Lon, Jülich Low-level protocol – end host measurements UDP, TCP, RDMA RoCEv2, kernel bypass libvma Good performance with careful tuning single flows 60 – 95 Gbit/s Technical Note started Tests on local sites Different hardware and OS versions 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

iperf: TCP throughput multiple flows Distribute IRQs over all cores on node 1 Run iperf on cores 6-11 for both receive and sending Firewalls ON, TCP offload on, TCP cubic Total throughput: increase 60  86 2  3 flows 98 Gbit/s for 4 & 5 flows then starts to fall 10-4 - 10-5 % re-tx 2, 3 flows 10-4 % re-tx 4, 5 flows 10-3 % re-tx 8, 10 flows Individual flows can vary by ± 5 Gbit/s 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Network Tuning for 100 Gigabit Ethernet Hyper threading Turn off in the BIOS Wait states Disable / minimise use of c-states. Use the BIOS or at boot time Power saving Core Frequency Set governor “performance” Set cpufreq to maximum Depends on scaling_driver: acpi-cpufreq allows setting cpuinfo_cur_freq to max intel_pstate does not but seems fast anyway NUMA Check and select CPU cores in the node with the Ethernet interfaces attached Read the current settings $cat /sys/devices/system/cpu/cpu*/cpufreq/* $cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor Set $echo “performance” > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor $numactl –H $cat /sys/devices/system/node/node*/cpulist $lspci –tv $cat /sys/class/net/*/device/uevent 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Network Tuning for 100 Gigabit Ethernet IRQs Turn off the irqbalance service – prevents balancer from changing the affinity scheme. Set affinity of the NIC IRQs to use CPU cores on the node with PCIe 1 per CPU. For UDP seems best NOT to use the CPU cores used by the apps. Interface parameters Ensure interrupt coalescence is ON – 3 μs, 16 μs, more ? Ensure Rx & Tx checksum offload is ON Ensure tcp-segmentation-offload is ON Set the Tx Rx ring buffer size MTU Set IP MTU 9000 Bytes #systemctl stop irqbalance.service #cat /proc/irq/<irq>/smp_affinity #echo 400 > /proc/irq/183/smp_affinity #/usr/sbin/set_irq_affinity_cpulist.sh 8-11 enp131s0f0 #ethtool –C <i/f> rx-usecs 8 #ethtool –K <i/f> rx on tx on #ethtool –K <i/f> tso on #ethtool –G <i/f> rx 8192 #ethtool –G <i/f> tx 8192 Best set in files eg ifcfg_ethx mtu=9000 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen

Network Tuning for 100 Gigabit Ethernet Queues Set txqueuelen – transmit Q (I used 1000 but 10,000 recommended) Set netdev_max_backlog – say 250000 – Q between interface and IP stack Kernel parameters net.core.rmem_max net.core.wmem_max net.ipv4.tcp_rmem net.ipv4.tcp_wmem (min / default / max) net.ipv4.tcp_mtu_probing (jumbo frames) net.ipv4.tcp_congestion_control (htcp, cubic) net.ipv4.tcp_mem (set the max to cover rmem/wmem max) Better to choose fewer high speed cores http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf Esnet FasterData https://fasterdata.es.net/network-tuning/ Best in file /etc/sysctl.conf Interface parameters – reduce CPU load 29 Nov 17 Richard Hughes-Jones 3rd SIG-PMV Copenhagen