LCG TCP performance optimization for 10 Gb/s LHCOPN connections 1 on behalf of M. Bencivenni, T.Ferrari, D. De Girolamo, Stefano.

Slides:

Advertisements

Similar presentations

TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot

Advertisements

TCP Vegas: New Techniques for Congestion Detection and Control.

Iperf Tutorial Jon Dugan Summer JointTechs 2010, Columbus, OH.

1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.

ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

Evaluating System Performance in Gigabit Networks King Fahd University of Petroleum and Minerals (KFUPM) INFORMATION AND COMPUTER SCIENCE DEPARTMENT Dr.

DataTAG Meeting CERN 7-8 May 03 R. Hughes-Jones Manchester 1 High Throughput: Progress and Current Results Lots of people helped: MB-NG team at UCL MB-NG.

PFLDNet Argonne Feb 2004 R. Hughes-Jones Manchester 1 UDP Performance and PCI-X Activity of the Intel 10 Gigabit Ethernet Adapter on: HP rx2600 Dual Itanium.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

G Robert Grimm New York University Receiver Livelock.

COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.

Modelli di differenziazione delle prestazioni per il supporto del traffico LHC1 Modelli di differenziazione delle prestazioni per il supporto del traffico.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.

Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.

CMS Data Transfer Challenges LHCOPN-LHCONE meeting Michigan, Sept 15/16th, 2014 Azher Mughal Caltech.

TNC 2007 Bandwidth-on-demand to reach the optimal throughput of media Brecht Vermeulen Stijn Eeckhaut, Stijn De Smet, Bruno Volckaert, Joachim Vermeir,

© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Unit OS6: Device Management 6.1. Principles of I/O.

Xen I/O Overview. Xen is a popular open-source x86 virtual machine monitor – full-virtualization – para-virtualization para-virtualization as a more efficient.

Xen I/O Overview.

The NE010 iWARP Adapter Gary Montry Senior Scientist

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Make Hosts Ready for Gigabit Networks. Hardware Requirement To allow a host to fully utilize Gbps bandwidth, its hardware system must be ready for Gbps.

Optimizing UDP-based Protocol Implementations Yunhong Gu and Robert L. Grossman Presenter: Michal Sabala National Center for Data Mining.

Data transfer over the wide area network with a large round trip time H. Matsunaga, T. Isobe, T. Mashimo, H. Sakamoto, I. Ueda International Center for.

10GE network tests with UDP

Eliminating Receive Livelock in an Interrupt-Driven Kernel J. C. Mogul and K. K. Ramakrishnana Presented by I. Kim, 01/04/13.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Srihari Makineni & Ravi Iyer Communications Technology Lab

High TCP performance over wide area networks Arlington, VA May 8, 2002 Sylvain Ravot CalTech HENP Working Group.

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

1 Network Performance Optimisation and Load Balancing Wulf Thannhaeuser.

Wide Area Network Performance Analysis Methodology Wenji Wu, Phil DeMar, Mark Bowden Fermilab ESCC/Internet2 Joint Techs Workshop 2007

Experiences Tuning Cluster Hosts 1GigE and 10GbE Paul Hyder Cooperative Institute for Research in Environmental Sciences, CU Boulder Cooperative Institute.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Oindrila.

Roberto Barbera Prague, ALICE Multi-site Data Transfer Tests on a Wide Area Network Giuseppe Lo Re Roberto Barbera Work in collaboration with:

Supporting Multimedia Communication over a Gigabit Ethernet Network VARUN PIUS RODRIGUES.

Memory and network stack tuning in Linux:

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University.

Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom Geneva October 15, 2003

Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.

GNEW2004 CERN March 2004 R. Hughes-Jones Manchester 1 Lessons Learned in Grid Networking or How do we get end-2-end performance to Real Users ? Richard.

TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot

© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Congestion Control 0.

Final EU Review - 24/03/2004 DataTAG is a project funded by the European Commission under contract IST Richard Hughes-Jones The University of.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

S. Ravot, J. Bunn, H. Newman, Y. Xia, D. Nae California Institute of Technology CHEP 2004 Network Session September 1, 2004 Breaking the 1 GByte/sec Barrier?

L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,

R&D on data transmission FPGA → PC using UDP over 10-Gigabit Ethernet Domenico Galli Università di Bologna and INFN, Sezione di Bologna XII SuperB Project.

16 th IEEE NPSS Real Time Conference 2009 IHEP, Beijing, China, 12 th May, 2009 High Rate Packets Transmission on 10 Gbit/s Ethernet LAN Using Commodity.

Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.

Wide Area Network Performance Analysis Methodology

Input/Output Devices ENCE 360

TCP Vegas: New Techniques for Congestion Detection and Avoidance

Transport Protocols over Circuits/VCs

Open Source 10g Talk at KTH/Kista

CS 286 Computer Organization and Architecture

Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson

Understanding Throughput & TCP Windows

MB-NG Review High Performance Network Demonstration 21 April 2004

Xen Network I/O Performance Analysis and Opportunities for Improvement

Cluster Computers.

Presentation transcript:

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 1 on behalf of M. Bencivenni, T.Ferrari, D. De Girolamo, Stefano Zani (INFN CNAF) Andreas Hirstius (CERN) HEPiX Spring Meeting, Roma, Apr

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 2 Objectives Test the 10 Gb/s path CERN – CNAF in preparation to the LHC Service Challenges Identify the list of sw and hw parameters to be tuned in order to optimize single-flow achievable throughput Compare different TCP stacks (Reno and BIC) and Linux kernel versions Compare different 10 GigaEthernet NICs

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 3 Performance Metrics and Parameters Performance metrics: –Achievable throughput (at the application level, memory-to-memory transfers, TCP and UDP) –Achievable throughput convergence time –CPU utilization Parameters: –TCP application parameters: send/receive socket buffer Application read/write block size –PCI-X bus: Max Memory Byte Read Count –Linux kernel version: EL.cernsmp 2.6 vs #2 SMP –Queuing: txqueue e backlog queue sizes

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 4 Testbed: WAN Network Layout Network path under test, 10 Gb/s Source: E.Martelli, LHCOPN meeting, Jan 2006

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 5 Testbed: Servers CERN (1 server): –4 processors, GenuineIntel, IA-64, Itanium 2, cpu MHz: –NIC: 10 GigaEthernet s2io –LAN: connection to Force10 CE switch CNAF (2 servers): –2 processors, AMD Opteron Processor 252 (Athlon), cpu MHz: –NIC: Intel PRO/10GbE –LAN: connection BlackDiamond (10 GbE), directly connected to the GARR PoP in Bologna Overall max point-to-point performance measured: –6.8 Gb/s CNAF  CERN, 5 TCP streams, TCP BIC, MTU = 9000 B

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 6 Tuning at the application-level: application read/write block size and send/receive socket buffer size  Large application read/write block sizes give:  Higher CPU utilization and reduction of idle time  Increase of achievable throughput up to the total utilization of the overall number of CPU cycles available on the tx server  Send/receive socket buffer:  Min size needed (according to our tests): 20 MB Application Read/write block size (B)Achievable Throughput (Gb/s)

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 7 PCI-X Bus Configuration: Max Memory Byte Read Count (mmbrc) mmbrc – defines the max number of bytes that can be exchanged on the PCI-X bus during a single transaction from/to and I/O device to/from the server RAM Range: –512,1024, 2048, 4096 B  A large mmbrc grants:  A better data/transmission overhead ratio on the PCI-X bus  More efficient CPU utilization Application Read/write block size (B) Achievable Throughput (Gb/s) 1 TCP stream, TCP Reno

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 8 TCP BIC (Binary Increase Congestion control) –In case of loss: max’ = cwnd min’ = cwnd / 2 target = (max’ + min’) / 2 –In case of growth (until max – min < S_min): max’ = max min’ = cwnd target = (max’ + min’) / 2 If target – cwnd > S_max –Cwnd’ = cwnd + S_max (additive increase) Default protocol stack in the latest Linux kernel 2.6.x versions Slow start and Logarithmic increase of the congestion window parameter (cwnd) + additional increase Binary search algorithm when increasing/decreasing cwnd: min/target/max

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 9 TCP Reno vs TCP BIC over WAN TCP Reno  default in kernel 2.4 TCP BIC  default in kernel 2.6.x (different process scheduling algorithms in latest kernels as well) TCP BIC logarithmic increase gives: –Faster convergence to quilibrium –TCP Fairness: bounded for all window sizes –RTT Fairness: for large windows, throughput distribution between streams with different RTT similar to AIMD algorithms CWND convergente time with different TCP stacks Achievable throughput (Gb/s) Time (s)

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 10 Best performance measured over WAN After tuning of the parameters mentioned before: 5 streams, TCP Reno  6.2 Gb/s 5 streams, TCP BIC  6.8 Gb/s  TCP BIC provides bettern performance in case of high-bandwidth long-distance network paths  Kernel 2.6 offers process scheduling algorithms that are more efficient  better performance in case of multiple streams, while on with a single stream TCP Reno can perfom better  Kernel 2.6: more CPU demanding

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 11 Txqueue and backlog queue sizes Tuning of the size of the backlog (rx) and transmission queue (tx) in the kernel is necessary in order to: –avoid that packet descriptors are lost because of no available space in the txqueue (before reaching the tx NIC) and the backlog queue (i.e. at the very last stage of the transmission process), due to a high tx and/or rx rate The minimum queue size depends on the NIC speed and on the MTU in use: –jumbo frames (9000 B), 10 GigaEthernet  1000 packets –1500 B, 1 GigaEthernet  packets

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 12 Queuing in the Linux kernel: backlog queue during reception (*) Stage 1 Stage 2 Receive socket Buffer in RAM Stage 3 Backlog queue (*) source: [1] Backlog queue (per CPU)

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 13 Queuing in the Linux kernel: txqueue during transmission (*) Stage 1 Txqueue queue Stage 2 Send socket buffer Stage 3 To DMA engine, NIC (*) source: [1] Txqueue

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 14 Interrupt coalescence Most modern NICs provide interrupt moderation or interrupt coalescing mechanisms to reduce the number of IRQs generated when receiving packets. In this case, interrupts are generated only after a number of packets has been transmitted/received, or after a timeout from the last IRQ generated has expired, whichever comes first. This allows to relieve the CPU from IRQ storms generated during high traffic load, improving the forwarding rate.

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 15 New API Receive livelock can be easily avoided by disabling IRQ generation on all NICs and letting the operating system decide when to poll the NIC hardware status register to determine whether new packets have been received. The NIC polling frequency is determined by the operating system and, as a consequence, a polling-driven stack may increase the packet forwarding latency under light traffic load. The network softIRQ is modified so as to run poll on all interfaces on the polling list in a round-robin fashion to enforce fairness. No more than a budget B of packets can be extracted from NIC reception rings in a single invocation of the network softIRQ, in order to limit the time the CPU spends for processing packets. Whenever poll extracts less than Q packets from a NIC reception ring, it reverts such NIC to interrupt mode by removing it from the polling list and re-enabling IRQ notification.

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 16 Comparison of the two approaches (^) NAPI Network stack (poll) Interrupt-driven network stack (^) source: [2]

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 17 Additional 10Gb results (1/1) Memory to Memory: S2IO/Neterion Xframe NIC results (in collab. with DataTag) –WAN measurments in collaboration with DataTag –CERN  CalTech –Server: CERN: Quad Itanium 2 CalTech: Dual Opteron S2IO 10 Gb NICs –Transfer rate: 7.2 Gb/s Disk configuration: –one CPU servicing the interrupts for 1 RAID controller with 8 disks (3 CPUs, 3 controllers, 24 disks) –one CPU for the 10Gb NIC interrupts –HDs in JBOD mode, sw RAID0 –(with new ARECA controllers: hw RAID 5, sw RAID 0  same read performance and ~900MB/s write)

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 18 Additional 10Gb results (1/2) Disk-to-memory transfers (CERN  CalTech, 2004) –Server: CERN: Quad Itanium2 with 24x SATA disks and 3x 3Ware 9500 controllers –local disk I/O: 1.1GB/s read; ~500MB/s write Caltech: Quad Opteron 24x SATA disks and 3x SuperMicro controller –local disk I/O: ~450MB/s read; ~450MB/s write –single stream transfer rate: ~700MB/s –multi stream transfer rate: ~690 MB/s Disk-to-disk transfers –CERN  CalTech: ~375 MB/s (limited by receiving side) –CalTech  CERN: ~475 MB/s (limited by receiving side)

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 19 Conclusions Careful tuning at the application, kernel and PCI level needed Good understanding of kernel queuing mechanisms is important TCP BIC and TCP Reno in different kernel versions: –Comparison is difficult at 10 Gb/s as transmission is CPU-bound and kernels can differ significantly –TCP BIC: better convergence and overall achievable throughput in presence of multiple streams –NIC hardware architecture: can considerably affect the max achievable throughput

LCG TCP performance optimization for 10 Gb/s LHCOPN connections 20 References [1] Technical Report DataTAG FP5/IST DataTAG Project: A Map of the Networking Code in Linux Kernel [2] Open-Source PC-Based Software Routers: A Viable Approach to High-Performance Packet Switching, A.Bianco et alt.