IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami.

Slides:

Advertisements

Similar presentations

The Transmission Control Protocol (TCP) carries most Internet traffic, so performance of the Internet depends to a great extent on how well TCP works.

Advertisements

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner and Isaac Keslassy Department of Electrical Engineering, Technion. Gabi Bracha, Eyal.

1 Maintaining Packet Order in Two-Stage Switches Isaac Keslassy, Nick McKeown Stanford University.

The strength of routing Schemes. Main issues Eliminating the buzz: Are there real differences between forwarding schemes: OSPF vs. MPLS? Can we quantify.

1 School of Computing Science Simon Fraser University CMPT 771/471: Internet Architecture & Protocols TCP-Friendly Transport Protocols.

Traffic Engineering with Forward Fault Correction (FFC)

A Novel 3D Layer-Multiplexed On-Chip Network

Fault-Tolerant Network-Interface for Spatial Division Multiplexing Based Network-on-Chip By Anup Das.

Spring 2000CS 4611 Quality of Service Outline Realtime Applications Integrated Services Differentiated Services.

Architectures for Congestion-Sensitive Pricing of Network Services Thesis Defense by Murat Yuksel CS Department, RPI July 3 rd, 2002.

Jaringan Komputer Lanjut Traffic Management Aurelio Rahmadian.

The 9th Israel Networking Day 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion/Mellanox) Isaac Keslassy.

The War Between Mice and Elephants LIANG GUO, IBRAHIM MATTA Computer Science Department Boston University ICNP (International Conference on Network Protocols)

Worcester Polytechnic Institute The War Between Mice and Elephants Liang Guo, Ibrahim Matta Presented by Vasilios Mitrokostas for CS 577 / EE 537 Images.

High-Performance Networking Group Isaac Keslassy, Nick McKeown

IP traffic and QoS control : the need for flow aware networking Jim Roberts France Telecom R&D NSF-COST Workshop.

The War Between Mice and Elephants Presented By Eric Wang Liang Guo and Ibrahim Matta Boston University ICNP

Differentiated Services. Service Differentiation in the Internet Different applications have varying bandwidth, delay, and reliability requirements How.

Frame-Aggregated Concurrent Matching Switch Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)

Source-Adaptive Multilayered Multicast Algorithms for Real- Time Video Distribution Brett J. Vickers, Celio Albuquerque, and Tatsuya Suda IEEE/ACM Transactions.

Hash Tables With Finite Buckets Are Less Resistant to Deletions Yossi Kanizo (Technion, Israel) Joint work with David Hay (Columbia U. and Hebrew U.) and.

Isaac Keslassy, Shang-Tse (Da) Chuang, Nick McKeown Stanford University The Load-Balanced Router.

A Scalable Switch for Service Guarantees Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)

Providing Performance Guarantees in Multipass Network Processors Isaac KeslassyKirill KoganGabriel ScalosubMichael Segal EE, TechnionCISCO & CSE, BGU.

Comparing flow-oblivious and flow-aware adaptive routing Sara Oueslati and Jim Roberts France Telecom R&D CISS 2006 Princeton March 2006.

The Concurrent Matching Switch Architecture Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)

Parallel IP Lookup using Multiple SRAM-based Pipelines Authors: Weirong Jiang and Viktor K. Prasanna Presenter: Yi-Sheng, Lin ( 林意勝 ) Date:

048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion The.

A Practical Packet Reordering Mechanism with Flow Granularity for Parallel Exploiting in Network Processors 13 th WPDRTS April 4, 2005 Beibei Wu, Yang.

Modeling TCP in Small-Buffer Networks

Providing Performance Guarantees in Multipass Network Processors Isaac KeslassyKirill KoganGabriel ScalosubMichael Segal TechnionCisco & BGU (Ben-Gurion.

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner Joint work with Isaac Keslassy Faculty of Electrical Engineering Faculty of Electrical.

Statistical Approach to NoC Design Itamar Cohen, Ori Rottenstreich and Isaac Keslassy Technion (Israel)

Fundamental Complexity of Optical Systems Hadas Kogan, Isaac Keslassy Technion (Israel)

Optimal Load-Balancing Isaac Keslassy (Technion, Israel), Cheng-Shang Chang (National Tsing Hua University, Taiwan), Nick McKeown (Stanford University,

Routers with Small Buffers Yashar Ganjali High Performance Networking Group Stanford University

Locality-Aware Request Distribution in Cluster-based Network Servers Presented by: Kevin Boos Authors: Vivek S. Pai, Mohit Aron, et al. Rice University.

Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†

MATE: MPLS Adaptive Traffic Engineering Anwar Elwalid, et. al. IEEE INFOCOM 2001.

Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)

CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon (Technion, Israel) Joint work with Iddo Hanniel and Isaac Keslassy ( Technion ) 1.

Advance Computer Networking L-5 TCP & Routers Acknowledgments: Lecture slides are from the graduate level Computer Networks course thought by Srinivasan.

Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.

CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon Joint work with Iddo Hanniel and Isaac Keslassy Technion, Israel 1.

Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.

MaxNet NetLab Presentation Hailey Lam Outline MaxNet as an alternative to TCP Linux implementation of MaxNet Demonstration of fairness, quick.

Korea Advanced Institute of Science and Technology Network Systems Lab. 1 Dual-resource TCP/AQM for processing-constrained networks INFOCOM 2006, Barcelona,

Requirements for Simulation and Modeling Tools Sally Floyd NSF Workshop August 2005.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.

Kaleidoscope – Adding Colors to Kademlia Gil Einziger, Roy Friedman, Eyal Kibbar Computer Science, Technion 1.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.

Nick McKeown Spring 2012 Lecture 2,3 Output Queueing EE384x Packet Switch Architectures.

Selective Packet Inspection to Detect DoS Flooding Using Software Defined Networking Author : Tommy Chin Jr., Xenia Mountrouidou, Xiangyang Li and Kaiqi.

Ning WengANCS 2005 Design Considerations for Network Processors Operating Systems Tilman Wolf 1, Ning Weng 2 and Chia-Hui Tai 1 1 University of Massachusetts.

Minimizing Delay in Shared Pipelines Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) Yoram Revah, Aviran Kadosh.

Optimization Problems in Wireless Coding Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.

Spring Computer Networks1 Congestion Control Sections 6.1 – 6.4 Outline Preliminaries Queuing Discipline Reacting to Congestion Avoiding Congestion.

A Load-Balanced Switch with an Arbitrary Number of Linecards Offense Anwis Das.

Univ. of TehranIntroduction to Computer Network1 An Introduction Computer Networks An Introduction to Computer Networks University of Tehran Dept. of EE.

Networking (Cont’d). Congestion Control l Is achieved by informing nodes along a route that congestion has occurred and asking them to reduce their packet.

A New MAC Protocol for Wi-Fi Mesh Networks Tzu-Jane Tsai, Hsueh-Wen Tseng, and Ai-Chun Pang IEEE AINA’06.

Network Processing Systems Design

The Variable-Increment Counting Bloom Filter

Parallel Algorithm Design

Tapping Into The Unutilized Router Processing Power

Network-Wide Routing Oblivious Heavy Hitters

Lu Tang , Qun Huang, Patrick P. C. Lee

Presentation transcript:

IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami Cohen (IBM Research)

Network Processors (NPs)  NPs used in routers for almost everything  Forwarding  Classification  Deep Packet Inspection (DPI)  Firewalling  Traffic engineering  VPN encryption  LZS decompression  Advanced QoS ……  Increasingly heterogeneous processing demands. 2

Parallel Multi-Core NP Architecture Each packet is assigned to a Processing Element (PE)  Any per-packet load balancing scheme 3 E.g., Cavium CN68XX NP, EZChip NP-4

Packet Ordering in NP  NPs are required to avoid out-of-order packet transmission within a flow.  TCP throughput, cross-packet DPI, statistics, etc.  Naïve solution is avoiding reordering at all.  Heavy packets often delay light packets.  Can we reduce this reordering delay? 4 12 Stop!

5 The Problem Reducing reordering delay in parallel network processors Reducing reordering delay in parallel network processors

Multi-core Processing Alternatives  Static (hashed) mapping of flows to processing elements (PEs) [Cao et al., 2000], [Shi et al., 2005]  Potential to insufficient utilization of the PEs.  Feedback-based adaptation of static mapping [Kencl et al., 2002], [He et al., 2010], [We et al., 2011]  Causes packet reordering.  Pipeline without parallelism [Weng et al., 2004]  Not scalable, due to heterogeneous requirements and commands granularity. 6

Single SN (Sequence Number) Approach 7 12

Per-flow Sequencing (Ideal)  Actually, we need to preserve order only within a flow. [Khotimsky et al., 2002], [Wu et al., 2005], [Shi et al., 2007], [Cheng et al., 2008]  SN (sequence number) generator for each flow.  Ideal approach: minimal reordering delay.  Not scalable to a large number of flows [Meitinger et al., 2008] 8 47:113:1

Hashed SN (Sequence Number) Approach 9 1:17:1 1:2 Note: the flow is hashed to an SN generator, not to a PE

Our Proposal  Leverage estimation of packet processing delay.  Instead of arbitrary ordering domains created by a hash function, create ordering domains of packets with similar processing delay requirements.  Heavy-processing packet does not delay light-processing packet in the ordering unit.  Assumption: All packets within a given flow have similar processing requirements.  Reminder: required to preserve order only within the flow. 10

Processing Phases E.g.:  IP Forwarding = 1 phase  Encryption = 10 phases 11 Processing phase #1 Processing phase #2 Processing phase #3 Processing phase #4 Processing phase #5 Disclaimer: it is not a real packet processing code

RP 3 (Reordering Per Processing Phase) Algorithm 12 1:17:1 7:2  All the packets in the ordering domain have the same number of processing phases (up to K).  Lower similarity of processing delay affects the performance (reordering delay), but not the order!

Knowledge Frameworks  At what stage the packet processing requirements are known: 1. Known upon packet arrival. 2. Known only at the processing start. 3. Known only at the processing completion 

RP 3 Algorithm for Framework 3  Assumption: the packet processing requirements are known only when the processing completed.  Example: Packet that finished all its processing after 1 processing phase is not delayed by another currently processed packet in the 2nd phase.  Because it means that they are from different flows  Theorem: Ideal partition into phases would minimize the reordering delay to Number of phases

RP 3 Algorithm for Framework 3  But, in reality: 15

RP 3 Algorithm for Framework 3  Each packet needs to go through several SN generators.  After completing the φ -th processing phase it will ask for the next SN from the ( φ +1)-th SN generator. 16 Next SN Generator

RP 3 Algorithm for Framework 3  When a packet requests a new SN, it cannot always get it automatically immediately.  The φ -th SN generator grants new SN to the oldest packet that finished processing of φ phases.  There is no processing preemption! 17 Request next SN Granted next SN

RP 3 – Framework 3 18 (1) A packet arrives and is assigned an SN 1 (2) At end of processing phase φ send request for SN φ+1. When granted increment SN. (3) SN Generator φ : Grant token when SN==oldestSN φ Increment oldestSN φ, NextSN φ (4) PE: When finish processing phases, send to OU (5) OU: complete the SN grants (6) OU: When all SNs are granted– transmit to the output

Simulations Reordering Delay vs. Processing Variability  Synthetic traffic  Poisson arrivals  Uniform processing requirements distribution between [1,10] phases. For a fair comparison, 10 hash buckets in Hashed-SN algorithm.  Zipf distribution of the packets between 300 flows.  Phase processing delay variability:  Delay ~ U[min, max]. Variability = max/min.  E[delay]=100 time units Improvement in orders of magnitude Improvement also with high phase processing delay variability Phase processing delay variability Mean reordering delay Ideal conditions: no reordering delay. Improvement by an order of magnitude

Simulations Reordering Delay vs. Load 20 Improvement by orders of magnitude % Load Mean reordering delay  Real-life trace: CAIDA anonymized Internet traces  Note: reordering delay occurs even under low load.

21Summary  Novel reordering algorithms for parallel multi-core network processors  reduce reordering delays  Rely on the fact that all packets of a given flow have similar required processing functions.  Three frameworks that define the stages at which the network processor knows about the packet processing requirements.  Analysis using simulations  Reordering delays are negligible, both under synthetic traffic and real- life traces.  Analytical model (in the paper)

Thank you.