10GE network tests with UDP

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Ethernet Over PCI Express Presented by Kallol Biswas
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
LCG TCP performance optimization for 10 Gb/s LHCOPN connections 1 on behalf of M. Bencivenni, T.Ferrari, D. De Girolamo, Stefano.
Remigius K Mommsen Fermilab A New Event Builder for CMS Run II A New Event Builder for CMS Run II on behalf of the CMS DAQ group.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Figure 1.1 Interaction between applications and the operating system.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
GigE Knowledge. BODE, Company Profile Page: 2 Table of contents  GigE Benefits  Network Card and Jumbo Frames  Camera - IP address obtainment  Multi.
Time measurement of network data transfer R. Fantechi, G. Lamanna 25/5/2011.
Practical TDMA for Datacenter Ethernet
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
1.  Project Goals.  Project System Overview.  System Architecture.  Data Flow.  System Inputs.  System Outputs.  Rates.  Real Time Performance.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
LECC2003 AmsterdamMatthias Müller A RobIn Prototype for a PCI-Bus based Atlas Readout-System B. Gorini, M. Joos, J. Petersen (CERN, Geneva) A. Kugel, R.
A study of introduction of the virtualization technology into operator consoles T.Ohata, M.Ishii / SPring-8 ICALEPCS 2005, October 10-14, 2005 Geneva,
Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.
Windows 2000 Course Summary Computing Department, Lancaster University, UK.
Securing and Monitoring 10GbE WAN Links Steven Carter Center for Computational Sciences Oak Ridge National Laboratory.
RAC parameter tuning for remote access Carlos Fernando Gamboa, Brookhaven National Lab, US Frederick Luehring, Indiana University, US Distributed Database.
Srihari Makineni & Ravi Iyer Communications Technology Lab
ENW-9800 Copyright © PLANET Technology Corporation. All rights reserved. Dual 10Gbps SFP+ PCI Express Server Adapter.
Network Architecture for the LHCb DAQ Upgrade Guoming Liu CERN, Switzerland Upgrade DAQ Miniworkshop May 27, 2013.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Masaki Hirabaru NICT Koganei 3rd e-VLBI Workshop October 6, 2004 Makuhari, Japan Performance Measurement on Large Bandwidth-Delay Product.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Data Recording Model at XFEL CRISP 2 nd Annual meeting March 18-19, 2013 Djelloul Boukhelef 1Djelloul Boukhelef - XFEL.
Ethernet. Ethernet standards milestones 1973: Ethernet Invented 1983: 10Mbps Ethernet 1985: 10Mbps Repeater 1990: 10BASE-T 1995: 100Mbps Ethernet 1998:
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Supporting Multimedia Communication over a Gigabit Ethernet Network VARUN PIUS RODRIGUES.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Memory and network stack tuning in Linux:
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
© 2013 VMware Inc. All rights reserved Measuring OVS Performance Framework and Methodology Vasmi Abidi, Ying Chen, Mark Hamilton
C LUSTER OF R ESEARCH I NFRASTRUCTURES F OR S YNERGIES IN P HYSICS Prototype for High-Speed Data Acquisition at European XFEL CRISP 3 rd Annual meeting.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,
R&D on data transmission FPGA → PC using UDP over 10-Gigabit Ethernet Domenico Galli Università di Bologna and INFN, Sezione di Bologna XII SuperB Project.
16 th IEEE NPSS Real Time Conference 2009 IHEP, Beijing, China, 12 th May, 2009 High Rate Packets Transmission on 10 Gbit/s Ethernet LAN Using Commodity.
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
Connect communicate collaborate Performance Metrics & Basic Tools Robert Stoy, DFN EGI TF, Madrid September 2013.
CRISP WP18, High-speed data recording Krzysztof Wrona, European XFEL PSI, 18 March 2013.
ROD Activities at Dresden Andreas Glatte, Andreas Meyer, Andy Kielburg-Jeka, Arno Straessner LAr Electronics Upgrade Meeting – LAr Week September 2009.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.
Considerations for Benchmarking Virtual Networks Samuel Kommu, Jacob Rapp, Ben Basler,
Balazs Voneki CERN/EP/LHCb Online group
NaNet Problem: lower communication latency and its fluctuations. How?
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Tango Integration of Modern 2D Detectors
CMS DAQ Event Builder Based on Gigabit Ethernet
Data Transfer Node Performance GÉANT & AENEAS
CS 286 Computer Organization and Architecture
Multi-PCIe socket network device
Computing Infrastructure for DAQ, DM and SC
Boost Linux Performance with Enhancements from Oracle
Get the best out of VPP and inter-VM communications.
TCP over SoNIC Abhishek Kumar Maurya
Presentation transcript:

10GE network tests with UDP Janusz Szuba European XFEL How to edit the title slide Upper area: Title of your talk, max. 2 rows of the defined size (55 pt) Lower area (subtitle): Conference/meeting/workshop, location, date, your name and affiliation, max. 4 rows of the defined size (32 pt) Change the partner logos or add others in the last row.

Overview of initial DAQ architecture Slice test hardware specification Outline Overview of initial DAQ architecture Slice test hardware specification Initial networking test results DAQ software UDP tests Summary J.Szuba, European XFEL

Initial DAQ architecture This talk focuses here 10GE UDP Multiple layers with well defined APIs to allow insertion of new technology layers Multiple slices for partitioning and scaling camera sizes will increase and slice replication will be one solution Enforce data reduction and rejection in all layers early data size reductions and data rejection are needed to reduce storage resources J.Szuba, European XFEL

Slice test hardware setup TB prototype PC and compute nodes 10GE UDP 10 Gbps switch Storage servers J.Szuba, European XFEL

Hardware specification Final width of slice is 16 inputs from TB to 16 PCL servers – covering 1Mpxl detector Specification of the first ½ of slice: h/w function # Spec PowerEdge R610 PC layer 8 3GHz, 2Socket, 6Core, Intel X5675, 96GB, Intel X520 DA2 dual 10GbE NIC Metadata server 2 PowerEdge C6145 Computing cluster 2.6GHz, 4Socket, 16Core, AMD 6282SE, 192GB, Intel X520 DA2 dual 10GbE NIC Cisco 5596UP 10GbE switch 1 48 ports (extendible to 96), wire speed, SFP+ PowerEdge C410x PCIe chassis 8 HIC Inputs,16 x16 PCIe Sleds NVIDIA M2075 Computing 4 PCIe x16 GPGPU IBM x3650M4 9TB servers (fast SAS) 6 2.6GHz, 2Sock, 8Core, Intel E5-2670 SandyBridge, 128GB, Mellanox ConnectX-3 VPI, PCIe Gen3, QSFP (with adapter to SFP+) Dual NIC, 14x 900GB 10Krpm 6Gbps SAS, RAID6 Servers for extension chassis 2.6GHz, 2Sock, 8Core, Intel E5-2670 SandyBridge, 128GB, Mellanox ConnectX-3 VPI, PCIe Gen3, QSFP (with adapter to SFP+) Dual NIC IBM EXP2512 28TB extension chassis 3TB 7.2Krpm 6Gbps NL SAS, RAID6, connected through 2 mini SAS connectors Demonstrator TB Board and ATCA crate J.Szuba, European XFEL

Initial network tests with system tools Two hosts (1+2), each dual 10Gbps NIC, connected via switch Tools netperf netperf -H 192.168.142.55 -t tcp_stream -l 60 -T 3,4 -- -s 256K -S 256K mpstat for CPU and interrupts usage Settings Socket buffer size (128kB-4MB) Adjusted kernel parameters MTU 1500 and 9000 (Jumbo frames) CPU affinity 2 modes: net.core.rmem_max = 16777216 # 131071 net.core.wmem_max = 16777216 # 131071 net.core.rmem_default = 10000000 # 129024 net.core.wmem_default = 10000000 # 129024 net.core.netdev_max_backlog = 300000 # 1000 Txqueuelen = 10000 # 1000 Unidirectional stream Bidirectional stream PCL Node 1 switch PCL Node 2 PCL Node 1 switch PCL Node 2 J.Szuba, European XFEL

TCP and UDP throughput tests Both TCP and UDP bandwith at wire speed More tuning required to reduce UDP packet loss Further tests performed with developed DAQ software J.Szuba, European XFEL

Software architecture - overview Two applications are under development Train builder emulator PCL node application Data-driven model (push) Start PC nodes Wait for data Start UDP feeder Wait for timer signal Start timing system Communication model 1-to-1, N-to-N Train builder emulator Timing system UDP feeder 1 UDP feeder 2 UDP feeder m UDP Node 1 Node 2 Node n PC Layer J.Szuba, European XFEL

TB data format Train structure (configurable) Implementation Header, images & descriptors, detector data, trailer Implementation Plain vanilla UDP The train is divided into chunks of 8k (configurable) Packet structure Header Body Trailer Images data Images desc. Data 8129 bytes 000 001 002 003 . # Frame 4b SoF=1 EoF=1 SoF, EoF 1b # Packet 3b J.Szuba, European XFEL

Tuning summary I Driver Linux kernel (Socket buffers) Newer driver with NAPI enabled - ixgbe-3.9.15 InterruptThrottleRate=0,0 (=disabled, default 16000, 16000) LRO (Large Receive Offload) =1 (default) DCA (Direct Cache Access)=1(default) Linux kernel (Socket buffers) net.core.rmem_max=1342177280 (default 131071) net.core.wmem_max=516777216 (default 131071) net.core.rmem_default=10000000 (default 129024) net.core.wmem_default=10000000 (default 129024) net.core.netdev_max_backlog=416384 (default 1000) net.core.optmem_max= 524288 (default 20480) net.ipv4.udp_mem = 11416320 15221760 22832640 (default 262144 327680 393216) net.core.netdev_budget = 1024 (replaces max_backlog for NAPI driver) J.Szuba, European XFEL

Tuning summary II NIC: PCI bus Linux services Jumbo frame: MTU=9000 Ring buffer RX & TX= 4096 (default 512) rx-usec= 1 (default 1) tx-frames-irq= 4096 (default 512) PCI bus MMBRC (max memory byte read count) = 4096 bytes (default 512) Linux services Services iptables, irqbalance, cpuspeed are stoped IRQ affinity: All Interrupts are handled by Core 0 Hyperthreading: no visible effect (ON/OFF) J.Szuba, European XFEL

Packet steering thread (soft IRQ) Test setup Application setting Core 0 is exclusively used for processing IRQ Sender thread on core 7 Receiver thread on core 2 ( even number, except 0) Some processing is done on received packets Separate thread to compile statistics, write to log file run in parallel on core 5 (rate every 1sec) Run in root mode: necessary to set scheduling policy and thread priority Packet steering thread (soft IRQ) UDP receiver thread UDP sender thread Stat/control thread Dual port NIC NUMA node 0 NUMA node 1 J.Szuba, European XFEL

Test results Pairs of sender-receiver run in parallel on all machines 1 5 2 3 4 6 7 8 Pairs of sender-receiver run in parallel on all machines UDP packet size 8192, configurable Run length (#packets/total time): 106 (12.21 sec, 6.6 seconds without time profile) 107 ~ 2.5x109 (several times w/out time profile) 5x109 (16h37m) Emulating realistic timing profile from TB Each node sends 1 train (~131030 packets), then wait for 0.7 sec Packet loss Few 10s~100s packets might be lost When? Loss is observed only in the beginning of a run (warm-up phase?) Packet loss rate <10-9 (excluding warm-up phase) Where? All machines, few machines, no machine J.Szuba, European XFEL

UDP reader thread (core 2) CPU usage UDP reader thread (core 2) idle cpu usage (%) user system time (arbitrary units) J.Szuba, European XFEL

Tuning is required to achieve full bandwidth for 10GE Summary Tuning is required to achieve full bandwidth for 10GE Additional kernel, driver and system and application tuning required to reduce UDP packet loss J.Szuba, European XFEL