10Gbit/s Bi-Directional Routing on standard hardware running Linux 10Gbit/s Bi-Directional Routing on standard hardware running Linux by Jesper Dangaard.

Slides:



Advertisements
Similar presentations
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Advertisements

TCP-IP Primer David Cozens. Targets Have a basic understanding of Ethernet network technology Be aware of how this technology is applied on the 5000 series.
Merit Network: Connecting People and Organizations Since 1966 CALEA Compliance – A Feasibility Study October 25, 2006 Mary Eileen McLaughlin Director –
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Performance Analysis of Orb Rabin Karki and Thangam V. Seenivasan 1.
CSCI 4550/8556 Computer Networks Comer, Chapter 11: Extending LANs: Fiber Modems, Repeaters, Bridges and Switches.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Router Architectures An overview of router architectures.
Router Architectures An overview of router architectures.
GigE Knowledge. BODE, Company Profile Page: 2 Table of contents  GigE Benefits  Network Card and Jumbo Frames  Camera - IP address obtainment  Multi.
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
A+ Guide to Hardware: Managing, Maintaining, and Troubleshooting, Sixth Edition Chapter 9, Part 11 Satisfying Customer Needs.
Is Lambda Switching Likely for Applications? Tom Lehman USC/Information Sciences Institute December 2001.
Hosting Virtual Networks on Commodity Hardware VINI Summer Camp.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
“ PC  PC Latency measurements” G.Lamanna, R.Fantechi & J.Kroon (CERN) TDAQ WG –
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
10GE network tests with UDP
Securing and Monitoring 10GbE WAN Links Steven Carter Center for Computational Sciences Oak Ridge National Laboratory.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Hardware. Make sure you have paper and pen to hand as you will need to take notes and write down answers and thoughts that you can refer to later on.
Srihari Makineni & Ravi Iyer Communications Technology Lab
ENW-9800 Copyright © PLANET Technology Corporation. All rights reserved. Dual 10Gbps SFP+ PCI Express Server Adapter.
Lecture 25 PC System Architecture PCIe Interconnect
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Memory and network stack tuning in Linux:
L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,
Chapter 1: Computer Basics Instructor:. Chapter 1: Computer Basics Learning Objectives: Understand the purpose and elements of information systems Recognize.
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
Lecture 2. General-Purpose Computer Systems Prof. Taeweon Suh Computer Science Education Korea University ECM586 Special Topics in Embedded Systems.
ROD Activities at Dresden Andreas Glatte, Andreas Meyer, Andy Kielburg-Jeka, Arno Straessner LAr Electronics Upgrade Meeting – LAr Week September 2009.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
Challenges and experiences with IPTV from a network point of view Challenges and experiences with IPTV from a network point of view by Jesper Dangaard.
Status from the IPTV probe project Status from the IPTV probe project Bifrost Workshop 2010 d.27/ ComX Networks A/S by Jesper Dangaard Brouer Master.
Open Source IPTV-Analyzer Based on Netfilter Formerly known as mp2t or mpeg2ts Netfilter Workshop 2011 d.22/ by Jesper Dangaard Brouer Master of.
Challenges and experiences with IPTV from a network point of view 7.th Netfilter Workshop 18.th to 21.th October 2010 ComX Networks A/S by Jesper Dangaard.
Multiqueue & Linux Networking Robert Olsson UU/KTH.
Perl iptables interface CPAN IPTables::libiptc Netfilter Workshop 2011 d.24/ by Jesper Dangaard Brouer Master of Computer Science ComX Networks A/S.
LISA Linux Switching Appliance Radu Rendec Ioan Nicu Octavian Purdila Universitatea Politehnica Bucuresti 5 th RoEduNet International Conference.
Status from Martin Josefsson Hardware level performance of Status from Martin Josefsson Hardware level performance of by Jesper Dangaard Brouer.
Open-source routing at 10Gb/s Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) SNCNW May 2009 Project grants: Internetstiftelsen (IIS) Equipment:
Discussion subject: The missing conntrack garbage collector Netfilter Workshop 2011 d.23/ by Jesper Dangaard Brouer Master of Computer Science ComX.
INTERNET PROTOCOL TELEVISION (IP-TV)
Memory vs. Networking LSF/MM-summit March, 2017 Jesper Dangaard Brouer
Ingress bandwidth shaping vs. Netfilter
High Speed Optical Interconnect Project May08-06
Balazs Voneki CERN/EP/LHCb Online group
PC based software router
Testing PCI Express Generation 1 & 2 with the RTO Oscilloscope
NFV Compute Acceleration APIs and Evaluation
LHCb and InfiniBand on FPGA
NaNet Problem: lower communication latency and its fluctuations. How?
KTH/CSD course kick-off Summer 2010 Robert Olsson
Kernel/Hardware for bifrost
CS111 Computer Programming
Central Processing Unit- CPU
Towards 10Gb/s open-source routing
Open Source 10g Talk at KTH/Kista
Report from Netconf 2009 Jesper Dangaard Brouer
Multi-PCIe socket network device
KTH/CSD course kick-off Spring 2010 Robert Olsson
KTH/CSD course kick-off Fall 2009 Robert Olsson
Boost Linux Performance with Enhancements from Oracle
Event Building With Smart NICs
MB – NG SuperJANET4 Development Network
Network Processors for a 1 MHz Trigger-DAQ System
IP Control Gateway (IPCG)
Processors Just the Basics.
Presentation transcript:

10Gbit/s Bi-Directional Routing on standard hardware running Linux 10Gbit/s Bi-Directional Routing on standard hardware running Linux by Jesper Dangaard Brouer Master of Computer Science Linux Kernel Developer ComX Networks A/S LinuxCon 2009 d.23/ ComX Networks A/S

2 /36 Linux: 10Gbit/s Bi-directional Routing Who am I ● Name: Jesper Dangaard Brouer – Edu: Computer Science for Uni. Copenhagen ● Focus on Network, Dist. sys and OS – Linux user since 1996, professional since 1998 ● Sysadm, Developer, Embedded – OpenSource projects ● Author of – ADSL-optimizer – CPAN IPTables::libiptc ● Patches accepted into – Kernel, iproute2, iptables and wireshark

3 /36 Linux: 10Gbit/s Bi-directional Routing Presentation overview ● When you leave this presentation, you will know: – The Linux Network stack scales with the number of CPUs – About PCI-express overhead and bandwidth – Know what hardware to choose – If its possible to do 10Gbit/s bidirectional routing on Linux? How many think is possible to do: 10Gbit/s bidirectional routing on Linux?

4 /36 Linux: 10Gbit/s Bi-directional Routing ComX Networks A/S ● I work for ComX Networks A/S – Danish Fiber Broadband Provider (TV, IPTV, VoIP, Internet) ● This talk is about – our experiments with 10GbE routing on Linux. ● Our motivation: – Primary budget/money (in these finance crisis times) ● Linux solution: factor 10 cheaper! – (60K USD -> 6K USD) – Need to upgrade capacity in backbone edges – Personal: Annoyed with bug on Foundry and tech support

5 /36 Linux: 10Gbit/s Bi-directional Routing Performance Target ● Usage "normal" Internet router – 2 port 10 Gbit/s router ● Bidirectional traffic ● Implying: – 40Gbit/s through the interfaces (5000 MB/s) – Internet packet size distribution ● No jumbo frame "cheats" ● Not only max MTU (1500 bytes) – Stability and robustness ● Must survive DoS attack with small packet

6 /36 Linux: 10Gbit/s Bi-directional Routing Compete: 10Gbit/s routing level ● Enabling hardware factors – PCI-express: Is a key enabling factor! ● Giving us enormous “backplane” capacity ● PCIe x16 gen.2 marketing numbers 160 Gbit/s – one-way 80Gbit/s → encoding 64Gbit/s → overhead ~54 Gbit/s – Scaling: NICs with multiple RX/TX queues in hardware ● Makes us scale beyond one CPU ● Large effort in software network stack (thanks DaveM!) – Memory bandwidth ● Target 40Gbit/s → 5000 MBytes/s

7 /36 Linux: 10Gbit/s Bi-directional Routing PCI-express: Overhead ● Encoding overhead: 20% (8b/10b encoding) – PCIe gen Gbit/s per lane → 2 Gbit/s ● Generation 1 vs. gen.2: – Double bandwidth, 2 Gbit/s → 4 Gbit/s ● Protocol overhead: Packet based – Overhead per packet: 20 Bytes (32-bit), 24 bytes (64-bit) – MaxPayload 128 bytes => 16 % additional overhead ● PCIe x8 = 32 Gbit/s ● MaxPayload 128 bytes => Gbit/s ● Details see: Title: “Hardware Level IO Benchmarking of PCI express*”

8 /36 Linux: 10Gbit/s Bi-directional Routing Hardware: Device Under Test (DUT) ● Starting with cheap gaming style hardware – No budget, skeptics in the company – CPUs: Core i7 (920) vs. Phenom II X4 (940) – RAM: DDR3 vs DDR2 ● Several PCI-express slots Generation 2.0 – Motherboards ● Core i7 – Chipset X58: Asus P6T6 WS Revolution ● AMD – Chipset 790GX: Gigabyte MA790GP-DS4H What can we expect from this hardware...

9 /36 Linux: 10Gbit/s Bi-directional Routing DUT: Memory Bandwidth ● Raw memory bandwidth enough? – Memory types: DDR2: Phenom II / DDR3: Core i7 ● Target 5000 MBytes/s (40Gbit/s) (2500 MB/s write and 2500MB/s reads) – HW should be capable of doing several 10GbE

10 /36 Linux: 10Gbit/s Bi-directional Routing 10Gbit/s Network Interface Cards ● Network Interface Cards (NICs) under test: – Sun Neptune (niu): 10GbE Dual port NIC PCIe x8 gen.1 (XFP) ● PCIe x8 = 16Gbit/s (overhead 16% = Gbit/s) – SMC Networks (sfc): 10GbE NIC, Solarflare chip (XFP) ● hardware queue issues, not default enabled ● only one TX queue, thus cannot parallelize – Intel (ixgbe): newest Intel chip NIC ● Fastest 10GbE NIC I have ever seen!!! ● Engineering samples from: – Intel: Dual port SFP+ based NIC (PCIe x8 Gen.2) – Hotlava Systems Inc.: 6 port SFP+ based NIC (PCIe x16 Gen.2)

11 /36 Linux: 10Gbit/s Bi-directional Routing Preliminary: Bulk throughput ● Establish: enough PCIe and Memory bandwidth? – Target: 2 port 10GbE bidir routing ● collective 40Gbit/s – with packet size 1500 bytes (MTU) Answer: Yes, but only with special hardware: CPU Core i7 Intel based NIC DDR3 Memory minimum at 1333 MHz QuickPath Interconnect (QPI) tuned to 6.4GT/s (default 4.8GT/s)

12 /36 Linux: 10Gbit/s Bi-directional Routing Observations: AMD: Phenom II ● AMD Phenom(tm) II X4 940 (AM2+) – Can do 10Gbit/s one-way routing – Cannot do bidirectional 10Gbit/s ● Memory bandwidth should be enough – Write 20Gbit/s and Read 20Gbit/s (2500MB/s) ● 1800 Mhz HyperTransport seems too slow – HT 1800Mhz ~ bandwidth 57.6 Gbit/s – "Under-clocking" HT: performance followed – Theory: Latency issue ● PCIe to memory latency too high, outstanding packets

13 /36 Linux: 10Gbit/s Bi-directional Routing Scaling with the number of CPUs ● To achieve these results – distribute the load across CPUs – A single CPU cannot handle 10GbE ● Enabling factor: “multiqueue” – NICs with multiple hardware RX/TX queues – Seperate IRQ per queue (both RX and TX) ● Lots of IRQs used – look in /proc/interrupts eg. ethX-rx-2

14 /36 Linux: 10Gbit/s Bi-directional Routing Linux Multiqueue Networking ● RX path: NIC computes hash – Also known as RSS (Receive-Side Scaling) – Bind flows to queue, avoid out-of-order packets ● Large effort in software network stack – TX qdisc API "hack", backward compatible ● – Beware: Bandwidth shapers break CPU scaling! Linux Network stack scales with the number of CPUs

15 /36 Linux: 10Gbit/s Bi-directional Routing Practical: Assign HW queue to CPUs ● Each HW (RX or TX) queue has individual IRQ – Look in /proc/interrupts – A naming scheme: ethXX-rx-0 ● Assign via "smp_affinity" mask – In /proc/irq/nn/smp_affinity – Trick: /proc/irq/*/eth31-rx-0/../smp_affinity – Trick: " grep. /proc/irq/*/eth31-*x-*/../smp_affinity " ● Use tool: 'mpstat -A -P ALL' – see if the interrupts are equally shared across CPUs

16 /36 Linux: 10Gbit/s Bi-directional Routing Binding RX to TX: Stay on same CPU ● RX to TX queue mapping: tied to the same CPU – Avoid/minimize cache misses, consider NUMA ● 3 use-cases for staying on the same CPU: – Forwarding (main focus) (RX to TX other NIC) ● How: Record queue number at RX and use it at TX ● Kernel for proper RX to TX mapping – Server (RX to TX) ● Caching of socket info (Credit to: Eric Dumazet) – Client (TX to RX) ● Hard, Flow "director" in 10GbE Intel NIC

17 /36 Linux: 10Gbit/s Bi-directional Routing Start on results ● Lets start to look at the results... – Have revealed ● Large frames: 10GbE bidirectional was possible! ● What about smaller frames? (you should ask...) – First need to look at: ● Equipment ● Tuning ● NICs and wiring

18 /36 Linux: 10Gbit/s Bi-directional Routing Test setup: Equipment ● Router: Core i7 920 (default 2.66 GHz) – RAM: DDR3 (PC ) 1600MHz X.M.P settings ● Patriot: DDR3 Viper Series Tri-Channel – Low Latency CAS – QPI at 6.4 GT/s (due to X.M.P) ● Generator#1: AMD Phenom 9950 quad ● Generator#2: AMD Phenom II X4 940

19 /36 Linux: 10Gbit/s Bi-directional Routing Test setup: Tuning ● Binding RX to TX queues ● Intel NIC tuning – Adjust interrupt mitigation parameters rx-usecs to 512 ● Avoiding interrupt storms (at small packet sizes) – 'ethtool -C eth31 rx-usecs 512' – Ethernet flow control (pause frames) ● Turn-off during tests: – To see effects of overloading the system – 'ethtool -A eth31 rx off tx off' ● Recommend turning on for production usage

20 /36 Linux: 10Gbit/s Bi-directional Routing NIC test setup #1 ● Router – Equipped with: Intel Dual port NICs – Kernel: rc1 (net-next-2.6 tree 8e321c4 ) ● Generators – Equipped with NICs connected to router ● Sun Neptune (niu) ● SMC (10GPCIe-XFP) solarflare (sfc) – Using: pktgen ● UDP packets ● Randomize dst port number – utilize multiple RX queue

21 /36 Linux: 10Gbit/s Bi-directional Routing Setup #1: 10GbE uni-directional ● Be skeptic – Packet generator ● Too slow! ● Sun NICs

22 /36 Linux: 10Gbit/s Bi-directional Routing Setup #1: Packet Per Sec ● Be skeptic – Packet generator ● Too slow! ● Sun NICs ● PPS limit

23 /36 Linux: 10Gbit/s Bi-directional Routing NIC test setup #2 ● Router and Generators – All equipped with: ● Intel based NICs ● Pktgen limits – Sun NICs max at 2.5 Mpps – Intel NIC (at packet size 64 byte) ● 8.4 Mpps with AMD CPU ● 11 Mpps with Core i7 CPU as generator

24 /36 Linux: 10Gbit/s Bi-directional Routing Setup #2: 10GbE uni-dir throughput ● Wirespeed 10GbE – uni-dir routing – pktsize 420

25 /36 Linux: 10Gbit/s Bi-directional Routing Setup #2: 10GbE uni-dir PPS ● Limits – Packet Per Sec – 3.5 Mpps ● at 64bytes

26 /36 Linux: 10Gbit/s Bi-directional Routing 10GbE Bi-directional routing ● Wirespeed at – 1514 and 1408 – 1280 almost ● Fast generators – can survive load!

27 /36 Linux: 10Gbit/s Bi-directional Routing 10GbE Bi-dir: Packets Per Sec ● Limited – by PPS rate – Max at ● 3.8 Mpps

28 /36 Linux: 10Gbit/s Bi-directional Routing ● Target: Internet router – Simulate Internet traffic with pktgen ● Based on Robert Olssons Article: – "Open-source routing at 10Gb/s" – Packet size distribution – Large number of flows ● 8192 flows, duration 30 pkts, destinations /8 prefix (16M) ● watch out for: – slow-path route lookups/sec – size of route cache Simulate: Internet traffic pattern

29 /36 Linux: 10Gbit/s Bi-directional Routing Uni-dir: Internet traffic pattern ● Simulate Internet Traffic – 10GbE uni-directional ● Route Cache is scaling – 900k entries in route cache – very little performance impact – almost same as const size packet forwarding

30 /36 Linux: 10Gbit/s Bi-directional Routing Bi-Dir: Internet traffic pattern ● 10GbE bi-directional – Strange: Average packet size in test ● Generators: 554 bytes vs. Receive size: 423 bytes – Route Cache seems to scale (1.2M entries) ● when comparing to const size packet tests

31 /36 Linux: 10Gbit/s Bi-directional Routing Summary: Target goal reached? ● 2 port 10GbE “Internet” router – Uni-Dir: Very impressive, Wirespeed with small pkts – Bi-dir: Wirespeed for 3 largest packet sizes ● Good curve, no choking ● Bandwidth: Well within our expected traffic load – Internet type traffic ● Uni-dir: Impressive ● Bi-dir: Follow pkt size graph, route cache scales – Traffic Overloading / semi-DoS ● Nice graphs, doesn't choke with small pkts Yes! - It is possible to do: 10Gbit/s bidirectional routing on Linux

32 /36 Linux: 10Gbit/s Bi-directional Routing ● Beyond 20GbE – Pktsize: 1514 ● Tx: 31.2 Gb/s ● Rx: 31.9 Gb/s – Enough mem bandwidth – > 2x 10GbE bidir – Real limit: PPS Bandwidth for 4 x 10GbE bidir ?

33 /36 Linux: 10Gbit/s Bi-directional Routing 4 x 10GbE bidir: Packets Per Sec ● Real limit – PPS limits

34 /36 Linux: 10Gbit/s Bi-directional Routing Summary: Lessons learned ● Linux Network stack scales – “multiqueue” framework works! ● Hardware – Choosing the right hardware essential ● PCI-express latency is important – Choose the right chipset – Watch out for interconnecting of PCIe switches ● Packets Per Second – is the real limit, not bandwidth

35 /36 Linux: 10Gbit/s Bi-directional Routing Future ● Buy server grade hardware – CPU Core i7 → Xeon 55xx ● Need min Xeon 5550 (QPI 6.4GT/s, RAM 1333Mhz) ● Two physical CPUs, NUMA challenges ● Smarter usage of HW queues – Assure QoS by assigning Real-time traffic to HW queue ● ComX example: IPTV multicast streaming ● Features affecting performance? ● Title: “How Fast Is Linux Networking” – Stephen Hemminger, Vyatta ● Japan Linux Symposium, Tokyo (22/ )

36 /36 Linux: 10Gbit/s Bi-directional Routing The End Thanks! Engineering samples: Intel Corporation Hotlava Systems Inc. SMC Networks Inc. Famous Top Linux Kernel Comitter David S. Miller

37 /36 Linux: 10Gbit/s Bi-directional Routing 10G optics too expensive! ● 10GbE SFP+ and XFP optics very expensive – There is a cheaper alternative! – Direct Attached Cables ● SFP+ LR optics price: 450 USD (need two) ● SFP+ Copper cable price: 40 USD – Tested cable from: ● Methode dataMate –

38 /36 Linux: 10Gbit/s Bi-directional Routing Pitfalls: Bad motherboard PCIe design ● Reason: Could not get beyond 2x 10GbE ● Motherboard: Asus P6T6 WS revolution – Two PCIe switches ● X58 plus NVIDIA's NF200 ● NF200 connected via PCIe gen.1 x16 – 32 Gbit/s -> overhead Gbit/s – Avoid using the NF200 slots ● scaling again...

39 /36 Linux: 10Gbit/s Bi-directional Routing Chipset X58: Core i7 ● Bloomfield, Nehalem ● Socket: LGA-1366

40 /36 Linux: 10Gbit/s Bi-directional Routing Chipset P55: Core i5 and i7-800 ● Lynnfield, Nehalem ● Socket: LGA-1156