16 th IEEE NPSS Real Time Conference 2009 IHEP, Beijing, China, 12 th May, 2009 High Rate Packets Transmission on 10 Gbit/s Ethernet LAN Using Commodity.

Slides:

Advertisements

Similar presentations

Ethernet “dominant” LAN technology: cheap $20 for 100Mbs!

Advertisements

High Speed Total Order for SAN infrastructure Tal Anker, Danny Dolev, Gregory Greenman, Ilya Shnaiderman School of Engineering and Computer Science The.

Remigius K Mommsen Fermilab A New Event Builder for CMS Run II A New Event Builder for CMS Run II on behalf of the CMS DAQ group.

FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.

5/8/2006 Nicole SAN Protocols 1 Storage Networking Protocols Nicole Opferman CS 526.

Router Architectures An overview of router architectures.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

INFO1119 (Fall 2012) INFO1119: Operating System and Hardware Module 2: Computer Components Hardware – Part 2 Hardware – Part 2.

Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.

I/O Acceleration in Server Architectures

Computer Organization CSC 405 Bus Structure. System Bus Functions and Features A bus is a common pathway across which data can travel within a computer.

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.

5 Feb 2002Alternative Ideas for the CALICE Backend System 1 Alternative Ideas for the CALICE Back-End System Matthew Warren and Gordon Crone University.

Is Lambda Switching Likely for Applications? Tom Lehman USC/Information Sciences Institute December 2001.

LOGO BUS SYSTEM Members: Bui Thi Diep Nguyen Thi Ngoc Mai Vu Thi Thuy Class: 1c06.

These materials are licensed under the Creative Commons Attribution-Noncommercial 3.0 Unported license (

Types of Computers Mainframe/Server Two Dual-Core Intel ® Xeon ® Processors 5140 Multi user access Large amount of RAM ( 48GB) and Backing Storage Desktop.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.

Lecture 3 Review of Internet Protocols Transport Layer.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:

Penn State CSE “Optimizing Network Virtualization in Xen” Aravind Menon, Alan L. Cox, Willy Zwaenepoel Presented by : Arjun R. Nath.

LECC2003 AmsterdamMatthias Müller A RobIn Prototype for a PCI-Bus based Atlas Readout-System B. Gorini, M. Joos, J. Petersen (CERN, Geneva) A. Kugel, R.

Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.

10GE network tests with UDP

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Srihari Makineni & Ravi Iyer Communications Technology Lab

ENW-9800 Copyright © PLANET Technology Corporation. All rights reserved. Dual 10Gbps SFP+ PCI Express Server Adapter.

2000 년 11 월 20 일 전북대학교 분산처리실험실 TCP Flow Control (nagle’s algorithm) 오 남 호 분산 처리 실험실

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

1 Network Performance Optimisation and Load Balancing Wulf Thannhaeuser.

ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Organisasi Sistem Komputer Materi VIII (Input Output)

TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.

ND The research group on Networks & Distributed systems.

Lecture 25 PC System Architecture PCIe Interconnect

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.

Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.

TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot

1 Farm Issues L1&HLT Implementation Review Niko Neufeld, CERN-EP Tuesday, April 29 th.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Artur BarczykRT2003, High Rate Event Building with Gigabit Ethernet Introduction Transport protocols Methods to enhance link utilisation Test.

R&D on data transmission FPGA → PC using UDP over 10-Gigabit Ethernet Domenico Galli Università di Bologna and INFN, Sezione di Bologna XII SuperB Project.

Remigius K Mommsen Fermilab CMS Run 2 Event Building.

The ALICE Data-Acquisition Read-out Receiver Card C. Soós et al. (for the ALICE collaboration) LECC September 2004, Boston.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

DIRECT MEMORY ACCESS and Computer Buses

High Rate Event Building with Gigabit Ethernet

Bus Systems ISA PCI AGP.

Presented by Kristen Carlson Accardi

RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne

CMS DAQ Event Builder Based on Gigabit Ethernet

CS 286 Computer Organization and Architecture

Internetworking: Hardware/Software Interface

Storage Networking Protocols

Event Building With Smart NICs

Performance Issues in WWW Servers

Network Processors for a 1 MHz Trigger-DAQ System

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Chapter 13: I/O Systems.

ECE 671 – Lecture 8 Network Adapters.

Presentation transcript:

16 th IEEE NPSS Real Time Conference 2009 IHEP, Beijing, China, 12 th May, 2009 High Rate Packets Transmission on 10 Gbit/s Ethernet LAN Using Commodity Hardware Domenico Galli Università di Bologna and INFN, Sezione di Bologna

Commodity Links More and more often used in HEP for DAQ, Event Building and High Level Trigger Systems: Limited costs; Maintainability; Upgradability. Demand of data throughput in HEP is increasing following: Physical event rate; Number of electronic channels; Reduction of the on-line event filter (trigger) stages. Industry has moved on since the design of the DAQ for the LHC experiments: 10 Gigabit Ethernet well established; 100 Gigabit Ethernet is being actively worked on. High Rate Packet Transmission on 10 Gigabit Ethernet. 2 Domenico Galli

10-GbE Network I/O “Fast network, slow host” scenario: Already seen in transitions to 1 Gigabit Ethernet: Presentation at IEEE-NPSS RT 2005 (Stockholm). 3 major system bottlenecks may limit the efficiency of high-performance I/O adapters: The PCI-X bus bandwidth (peak throughput 8.5 Gbit/s in 133 MHz flavor): Substituted by the PCI-E, (20 Gbit/s peak throughput in x8 flavor). The memory bandwidth: FSB has increased the clock from 533 MHz to 1600 MHz. The CPU utilization: Multi-core architectures. High Rate Packet Transmission on 10 Gigabit Ethernet. 3 Domenico Galli

10-GbE Point-to-Point Tests In real operating condition, maximum transfer rate limited not only by the capacity of the link itself, but also: by the capacity of the data busses (PCI and FSB); by the ability of the CPUs and of the OS to handle packet processing and interrupt rates raised by the network interface cards in due time. Data throughput & CPU load measured: NIC mounted on the PCI-E bus of commodity PCs as transmitters and receivers. High Rate Packet Transmission on 10 Gigabit Ethernet. 4 Domenico Galli 10GBase-SR

Test Platform MotherboardIBM X3650 Processor typeIntel Xeon E5335 Procesors x cores x clock (GHz)2 x 4 x 2.00 L2 cache (MiB)8 L2 speed (GHz)2.00 FSB speed (MHz)1333 ChipsetIntel 5000P RAM4 GiB NICMyricom 10G-PCIE-8A-S NIC DMA Speed (Gbit/s) ro / wo /rw10.44 / / High Rate Packet Transmission on 10 Gigabit Ethernet. 5 Domenico Galli

Settings net.core.rmem_max (B) net.core.wmem_max (B) net.ipv4.tcp_rmem (B)4096 / / net.ipv4.tcp_wmem (B)4096 / / net.core.netdev_max_backlog Interrupt Coalescence (μs)25 PCI-E speed (Gbit/s)2.5 PCI-E widthx8 Write Combiningenabled Interrupt TypeMSI High Rate Packet Transmission on 10 Gigabit Ethernet. 6 Domenico Galli

CPU Affinity Settings High Rate Packet Transmission on 10 Gigabit Ethernet. 7 Domenico Galli

CPU Affinity Settings (II) High Rate Packet Transmission on 10 Gigabit Ethernet. 8 Domenico Galli

UDP – Standard Frames 1500 B MTU (Maximum Transfer Unit). UDP datagrams sent as fast as they can be sent. Bottleneck: sender CPU core 2 (sender process 100 % system load). High Rate Packet Transmission on 10 Gigabit Ethernet. 9 Domenico Galli ~ 4.8 Gb/s ~ 440 kHz 2 frames 3 frames 4 frames User System IRQ Soft IRQ Total 100% (bottleneck) fake softIRQ softIRQ (4/5) IRQ (1/5) softIRQ (~50%) system (~50%)

UDP – Jumbo Frames 9000 B MTU. Sensible enhancement with respect to 1500 MTU. High Rate Packet Transmission on 10 Gigabit Ethernet. 10 Domenico Galli ~ 9.7 Gb/s ~ 440 kHz 2 frames 3 frames 4 frames 2 frames 3 frames 4 frames 2 PCI-E frames 3 PCI-E frames 100% (bottleneck) fake softIRQ softIRQ (4/5) IRQ (1/5) softIRQ (~50%) system (~50%) User System IRQ Soft IRQ Total

Additional dummy ps, bound to the same core of the tx ps (CPU 2), wasting CPU resources. CPU available for tx process trimmed using relative priority. The perfect linearity confirms that the system CPU sender side was actually the main bottleneck. 2 GHz  3 GHz CPU (same architecture): Potential increase of 50% in the maximum throughput: Provided that bottlenecks of other kinds do not show up before such increase is reached. UDP – Jumbo Frames (II) High Rate Packet Transmission on 10 Gigabit Ethernet. 11 Domenico Galli

UDP – Jumbo Frames 2 Sender Processes Doubled availability of CPU cycles to the sender PC. 10GbE fully saturated. Receiver (playing against 2 senders) not yet saturated. High Rate Packet Transmission on 10 Gigabit Ethernet. 12 Domenico Galli 200% (bottleneck) fake softIRQ softIRQ (4/5) IRQ (1/5) softIRQ (25-75%) system (75-90%) ~5 KiB no more CPU bottleneck User System IRQ Soft IRQ Total ~ 10 Gb/s ~ 600 kHz 2 frames 3 frames 4 frames 2 frames 3 frames 4 frames ~3 KiB

TCP – Standard Frames 1500 B MTU (Maximum Transfer Unit). TCP segments sent as fast as they can be sent. Bottleneck: sender CPU core 2 (sender process, 100% system load). High Rate Packet Transmission on 10 Gigabit Ethernet. 13 Domenico Galli 100% (bottleneck) fake softIRQ softIRQ (2/3) IRQ (1/3) softIRQ (<35%) system (<40%) ~ 5.8 Gb/s User System IRQ Soft IRQ Total

TCP – Jumbo Frames 9000 B MTU. Enhancement with respect to 1500 MTU (6  7 Gb/s). Bottleneck: sender CPU core 2 (sender process, 100% system load). High Rate Packet Transmission on 10 Gigabit Ethernet. 14 Domenico Galli User System IRQ Soft IRQ Total ~ 7 Gb/s 100% (bottleneck) fake softIRQ softIRQ (<15%) system (<45%)

TCP – Jumbo Frames TCP_NODELAY Nagle’s algorithm disabled. Segments are always sent as soon as possible, even if there is only a small amount of data. Small data packets no longer concatenated. Discontinuities of the UDP tests. UDP throughput not reached, due to the latency overhead of the TCP protocol. High Rate Packet Transmission on 10 Gigabit Ethernet. 15 Domenico Galli User System IRQ Soft IRQ Total 100% (bottleneck) fake softIRQ softIRQ (<15%) system (<45%) ~ 7.2 Gb/s 2 frames 3 frames 4 frames

TCP – Jumbo Frames TCP_CORK High Rate Packet Transmission on 10 Gigabit Ethernet. 16 Domenico Galli User System IRQ Soft IRQ Total Nagle’s algorithm disabled. OS does not send out partial frames at all until the application provides more data, even if the other end acknowledges all the outstanding data. No relevant differences. 100% (bottleneck) fake softIRQ softIRQ (<15%) system (<45%) ~ 7 Gb/s

TCP – Standard Frames Zero Copy sendfile() system call: transmit data read from a file descriptor to the network through a TCP socket. Avoid copy: user space  kernel space B MTU. Significant increase in throughput with respect to send(). High Rate Packet Transmission on 10 Gigabit Ethernet. 17 Domenico Galli User System IRQ Soft IRQ Total ~ 8.2 Gb/s 100% (bottleneck) ~5 KiB no more CPU bottleneck softIRQ (5-50%) system (10-55%) softIRQ (10-30%) IRQ (1-10%)

TCP – Jumbo Frames Zero Copy sendfile() system call. Improvement with respect to send() more significant. For send size > 2.5 KiB: Throughput = 10 Gbit/s Sender CPU 2 load < 100%: down to 30%. Only test able to saturate 10-GbE with a single ps. High Rate Packet Transmission on 10 Gigabit Ethernet. 18 Domenico Galli User System IRQ Soft IRQ Total ~ 10 Gb/s 2.5 KiB 1 KiB 100% (bottleneck) ~2.5 KiB no more CPU bottleneck softIRQ (10-30%) IRQ (1-10%) softIRQ (5-20%) system (10-60%) 30%

TCP – Jumbo Frames – Zero Copy No TSO TSO (TCP Segmentation Offload) switched off. TSO offloads segmentation in MTU-sized packets. CPU can hand over up to 64 KiB of data to the NIC in a single transmit request. No differences in throughput: 10-GbE link already saturated (size > 2.5 KiB). CPU load reduced by TSO (size > 2.5 KiB). High Rate Packet Transmission on 10 Gigabit Ethernet. 19 Domenico Galli User System IRQ Soft IRQ Total ~ 10 Gb/s 2.5 KiB 1 KiB 100% (bottleneck) ~2.5 KiB no more CPU bottleneck softIRQ (10-35%) IRQ (1-10%) softIRQ (5-20%) system (10-60%) 50%

TCP – Jumbo Frames – Zero Copy No LRO LRO (Large Receive Offload) switched off. LRO aggregates TCP segments at the NIC level into larger packets: Reducing the number of physical packets processed by the kernel. The performance (sizes > 2.5 KiB) slightly worse; Total load of the CPU 1 receiver sensibly increased, up to 100%. High Rate Packet Transmission on 10 Gigabit Ethernet. 20 Domenico Galli User System IRQ Soft IRQ Total ~ 10 Gb/s 2.5 KiB 1 KiB 100% (bottleneck) ~2.5 KiB no more CPU bottleneck softIRQ (10-30%) IRQ (1-10%) softIRQ (10-40%) system (15-100%) 100% (bottleneck)

TCP – Jumbo Frames – Zero Copy No Scatter-Gather Scatter-gather I/O switched off. By using scatter-gather I/O the packet does not need to be assembled into a single linear chunk by kernel: NIC is able to retrieve through DMA the fragments stored in non-contiguous memory locations. Performance was significantly improved by SG I/O. High Rate Packet Transmission on 10 Gigabit Ethernet. 21 Domenico Galli User System IRQ Soft IRQ Total ~ 7.4 Gb/s 100% (bottleneck) softIRQ (5-15%) system (10-45%)

TCP – Jumbo Frames – Zero Copy No Checksum Offload Checksum offload switched off. The task of calculating the TCP checksum can be offloaded to the network card both at the sender and at the receiver side. Performance is significantly improved. However, when the checksum offload is off, all the other offload functionalities of the are also switched off (SG, TSO, etc.). High Rate Packet Transmission on 10 Gigabit Ethernet. 22 Domenico Galli User System IRQ Soft IRQ Total 100% (bottleneck) ~ 6.9 Gb/s softIRQ (5-25%) system (10-50%)

Fiber Channel to 10-GbE Test Maximum throughput achievable by simultaneously reading data from 2 Fiber Channel links and sending them to a 10-GbE link: Closely emulating the behavior of a SAN disk-server exporting data by means of a network file system (NFS, GPFS, Lustre). A exports 2x20 GiB Disk Images as real SCSI devices (by using SCST) through FC. B aggregates the 2 SCSI device in RAID-0 (strip 64 KiB). File size: 5 GiB on XFS file system. Serial read. Cached (as file ) at A but cannot be cached at B. High Rate Packet Transmission on 10 Gigabit Ethernet. 23 Domenico Galli 6 GiB 4 GiB A B C

CPU Affinity Settings in FC  10-GbE Test High Rate Packet Transmission on 10 Gigabit Ethernet. 24 Domenico Galli

FC-Read 10-GbE-Send Test Read-ahead 8 MiB (default 128 KiB). Read-send at max speed from the 5 GiB file cyclically. sendfile() to transmit. Bottleneck: 10-GbE sender ps (system load); but up to 50% I/O wait. FC protocol lighter than 10GbE. High Rate Packet Transmission on 10 Gigabit Ethernet. 25 Domenico Galli 100% (bottleneck) ~ 6 Gb/s read + send read only User System IRQ Soft IRQ I/O wait Total 50% I/O wait FC HBA 1 SoftIRQ FC HBA 2 SoftIRQ 10-GbE IRQ+SoftIRQ fake softIRQ 10-GbE IRQ+SoftIRQ

Conclusion We performed a complete series of tests: Transferring data via 10-GbE links at full speed; Using either the UDP or the TCP protocol; By varying the MTU and the packet send size. Main bottleneck: CPU utilization at the sender side: System load of the transmitter process. Jumbo frames in fact mandatory for 10-GbE. In TCP transmission: Improvement can be obtained by zero-copy ( sendfile() ); Scatter-Gather functionality sensibly improves the performance; The TSO functionality helps the sender CPU. The LRO functionality helps the receiver CPU. 10-GbE fully support transmission from 2x 4 Gbit/s FC read: Without any apparent interference between the two phases of getting data from the SAN and sending them to the Ethernet. High Rate Packet Transmission on 10 Gigabit Ethernet. 26 Domenico Galli