Status from Martin Josefsson Hardware level performance of 82599 Status from Martin Josefsson Hardware level performance of 82599 by Jesper Dangaard Brouer.

Slides:



Advertisements
Similar presentations
RHEL6 tuning guide for mellanox ethernet card.
Advertisements

What happens when you try to build a low latency NIC? Mario Flajslik.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
1 Network Packet Generator Characterization presentation Supervisor: Mony Orbach Presenting: Eugeney Ryzhyk, Igor Brevdo.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Hardware Overview Net+ARM – Well Suited for Embedded Ethernet
Virtual WiFi: Bring Virtualization from Wired to Wireless Lei Xia, Sanjay Kumar, Xue Yang Praveen Gopalakrishnan, York Liu, Sebastian Schoenberg, Xingang.
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Evaluation of the LDC Computing Platform for Point 2 SuperMicro X6DHE-XB, X7DB8+ Andrey Shevel CERN PH-AID ALICE DAQ CERN 10 October 2006.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
© 2010 IBM Corporation Plugging the Hypervisor Abstraction Leaks Caused by Virtual Networking Alex Landau, David Hadas, Muli Ben-Yehuda IBM Research –
“ PC  PC Latency measurements” G.Lamanna, R.Fantechi & J.Kroon (CERN) TDAQ WG –
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
10GE network tests with UDP
Srihari Makineni & Ravi Iyer Communications Technology Lab
4 Dec 2006 Testing the machine (X7DBE-X) with 6 D-RORCs 1 Evaluation of the LDC Computing Platform for Point 2 SuperMicro X7DBE-X Andrey Shevel CERN PH-AID.
David Abbott - JLAB DAQ group Embedded-Linux Readout Controllers (Hardware Evaluation)
Guangdeng Liao, Xia Zhu, Steen Larsen, Laxmi Bhuyan, Ram Huggahalli University of California, Riverside Intel Labs.
Embedded Network Interface (ENI). What is ENI? Embedded Network Interface Originally called DPO (Digital Product Option) card Printer without network.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Use case of RDMA in Symantec storage software stack Om Prakash Agarwal Symantec.
Chap 4: Processors Mainly manufactured by Intel and AMD Important features of Processors: Processor Speed (900MHz, 3.2 GHz) Multiprocessing Capabilities.
L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,
ROM. ROM functionalities. ROM boards has to provide data format conversion. – Event fragments, from the FE electronics, enter the ROM as serial data stream;
Challenges and experiences with IPTV from a network point of view Challenges and experiences with IPTV from a network point of view by Jesper Dangaard.
10Gbit/s Bi-Directional Routing on standard hardware running Linux 10Gbit/s Bi-Directional Routing on standard hardware running Linux by Jesper Dangaard.
Status from the IPTV probe project Status from the IPTV probe project Bifrost Workshop 2010 d.27/ ComX Networks A/S by Jesper Dangaard Brouer Master.
Multiqueue & Linux Networking Robert Olsson UU/KTH.
Perl iptables interface CPAN IPTables::libiptc Netfilter Workshop 2011 d.24/ by Jesper Dangaard Brouer Master of Computer Science ComX Networks A/S.
LISA Linux Switching Appliance Radu Rendec Ioan Nicu Octavian Purdila Universitatea Politehnica Bucuresti 5 th RoEduNet International Conference.
Open-source routing at 10Gb/s Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) SNCNW May 2009 Project grants: Internetstiftelsen (IIS) Equipment:
GeoVision Recording Server. 2 GV-NVR LiteGV-NVRGV-Recording Server No. of Camera Integration GV-POSVVV GV-CMSVVV GV-LPRVV GV-Access.
Memory vs. Networking LSF/MM-summit March, 2017 Jesper Dangaard Brouer
High Speed Optical Interconnect Project May08-06
Balazs Voneki CERN/EP/LHCb Online group
PC based software router
M. Bellato INFN Padova and U. Marconi INFN Bologna
TLDK overview Konstantin Ananyev 05/08/2016.
NFV Compute Acceleration APIs and Evaluation
NaNet Problem: lower communication latency and its fluctuations. How?
KTH/CSD course kick-off Summer 2010 Robert Olsson
GPUNFV: a GPU-Accelerated NFV System
Kernel/Hardware for bifrost
Current Generation Hypervisor Type 1 Type 2.
Modifying libiptc: Making large iptables rulesets scale
TX bulking and qdisc layer
SystemC Simulation Based Memory Controller Optimization
Fabric Interfaces Architecture – v4
Reference Router on NetFPGA 1G
Towards 10Gb/s open-source routing
Open Source 10g Talk at KTH/Kista
Latency Measurement Testing
CS 286 Computer Organization and Architecture
Report from Netconf 2009 Jesper Dangaard Brouer
Configuring EtherChannels and Switch Troubleshooting
Multi-PCIe socket network device
Intel’s Core i7 Processor
NSH_SFC Performance Report FD.io NSH_SFC and CSIT Team
Using the WUGS-20 GigE Line Card
Get the best out of VPP and inter-VM communications.
Xen Network I/O Performance Analysis and Opportunities for Improvement
Operating Systems Chapter 5: Input/Output Management
IP Control Gateway (IPCG)
NetCloud Hong Kong 2017/12/11 NetCloud Hong Kong 2017/12/11 PA-Flow:
CS 295: Modern Systems Cache And Memory System
Presentation transcript:

Status from Martin Josefsson Hardware level performance of Status from Martin Josefsson Hardware level performance of by Jesper Dangaard Brouer Master of Computer Science ComX Networks A/S Testlab and Data provide by Martin Josefsson Procera Networks Bifrost Workshop 2010 d.27/1-2010

2 /11 10GbE nics and chipset issues with Background ● Martin Josefsson (aka Gandalf) – Could not attend Bifrost Workshop 2010 ● Thus, I'm giving a resume of his findings ● Tried to determine – the hardware performance limits of the chip – Found some chipset limitation ● Hardware shows ( Nehalem) – Very close to wirespeed forwarding capabilities ● But partial cacheline PCIe transfers drops perf

3 /11 10GbE nics and chipset issues with Test setup ● Dell R710: ● CPU 1 x E5520 – 4 cores 2.26GHz, QPI 5.86GT/s, 1066MHz DDR3 – Hyperthreading enabled improve by ~20%. ● NIC: 2 x Intel X520-DA: 2 port – single 82599ES, using pci-e x8 gen2 ● Traffic generator: – Smartbits with two 10GE modules.

4 /11 10GbE nics and chipset issues with Operating System: PLOS ● PLOS – is Procera Networks owns OS – 1 thread runs Linux, 7 threads run PLOS ● small standalone kernel – designed for packet processing only (no userspace) ● uses a channel concept – each channel, assigned two interfaces – packets are forwarded between the interfaces – giving a 1:1 mapping of RX and TX interfaces – (details in bonus slides)

5 /11 10GbE nics and chipset issues with Dumb forwarding: HW level ● Test results/setup – Very dumb forwarding of packets – Purpose: find the hardware limits – The packet data is not even touched by the CPUs ● Tests are bidirectional ● Results counts – Forwarded TX packets per sec – From both interfaces

6 /11 10GbE nics and chipset issues with Test results ● Two tests – One NIC: Using two ports on one dual port NIC – Two NICS: Using two NICs on port on each

7 /11 10GbE nics and chipset issues with Test results: Graph

8 /11 10GbE nics and chipset issues with Issue: Partial cacheline PCIe ● The cause of the performance hit: ● According to Intel there is an – issue with partial cacheline pci-e – transfers in the Tylersburg (X58, 5500, 5520) chipset. ● Each partial cacheline transfer – uses one outstanding pci-e transfer – and one outstanding QPI transfer.

9 /11 10GbE nics and chipset issues with Workaround? ● No solution at the moment to this issue :( ● A possible workaround – get the to pad pci-e transfers – to a multiple of the cacheline size – but unfortunately it doesn't have this capability

10 /11 10GbE nics and chipset issues with Close to Wirespeed ● A single port on an – Can't RX more than around 13.8 Mpps at 64 bytes ● Intel verified this – Ethernet wirespeed is 14.8 Mpps at 64 bytes ● Which I find quite impressive – Also compared to other 10GbE NICs – And compared to peek 5.7 Mpps by Linux

11 /11 10GbE nics and chipset issues with Conclusion ● The Nehalem architecture – a massive improvement over the old Intel arch – still issues that impacts performance greatly ● The NIC – Comes very close to wirespeed ● Parial cacheline PCI-e hurts badly! – Numbers are still impressive ● compared to Linux – room for optimizations!

12 /11 10GbE nics and chipset issues with Bonus slides ● After this slide is some bonus slides – With extra details...

13 /11 10GbE nics and chipset issues with Bonus#1: locking of RX queue ● Each interface – a single RX-queue and 7 TX-queues in tests ● Each PLOS thread locks the RX-queue – grabs a bunch of packets, – unlocks the RX-queue lock – forwards the packets ● to one of the TX-queues of the destination interface

14 /11 10GbE nics and chipset issues with Bonus#2: locking of RX queue ● RX is serialized by a spin-lock (or trylock) – threads continue to try to lock the next interface ● RX-queue if the current one is already locked ● This is clearly not optimal – RSS and DCA will be investigated in the future. ● TX is not serialized – but cases where serializing TX actually helps

15 /11 10GbE nics and chipset issues with Strange PCIe x4 experience ● Seems, get higher performance – with none cacheline sized packets – when the is running at pci-e 4x gen2. ● Channel with 2 ports on the same 82599ES: – pci-e x8 and 65 byte packets:8.2 Mpps – pci-e x4 and 65 byte packets:11.2 Mpps – I can't explain this, feels like this could have something – to do with either pci-e latency – or with how many outstanding pci-e transfers ● BIOS has configured for the pci-e port