A Closer Look at NFV Execution Models

A Closer Look at NFV Execution Models
APNet’19, th August, 2019, Beijing Peng Zheng1, Arvind Narayanan Zhi-Li Zhang1 Good afternoon everyone. My name is Peng and It's a great honor here to present our work. I’m going to talk about <NFV execution models>. This is a joint work with Arvind and Professor ZL Zhang at UMN. 1 2

Background NFV advocates running NFs on commodity servers … … …
Access Control (ACL) Network Monitor (NM) Load balancer (LB) L3 Forwarder (L3FWD) … … Dedicated middleboxes Commodity multi-core servers

Background NFV promises the benefits (flexibility & scalability) of software Rapid development, deployment and evolution Dynamically scaling in and scaling out on demand … … … … … Access Control (ACL) Network Monitor (NM) Load balancer (LB) L3 Forwarder (L3FWD) … … Dedicated middleboxes Commodity multi-core servers

Challenges of NFV NFV promises the benefits (flexibility & scalability) of software It’s challenging to achieve both: Maximum 100Gbps line speed on commodity multi-core servers Scalability and Flexibility afforded by software

Challenges of NFV Why attaining line speed for NFV is challenging
Time Budget per packet (64 byte) 6.7ns 67.2ns 16.8ns 1.7ns (~23 10Gbps 40Gbps 100Gbps Standard Approved 2002 2010 400Gbps 2017 Let have a look at Why attaining line speed is challenging. \\ In recent 20 years, the throughput of NICs are moving from 10Gbps to 100/400Gbps. Many companies in the industry, including Facebook and Google, have even expressed a need for Tb Ethernet. \\ To catch up the line speed processing, the time budget to process each packet decreases from near hundred ns to several ns. \\ To achieve 100Gbps line speed, the server has only 6.7ns to process the packet in size of 64 byte – \\ which is about 23 cycles when CPU is clocked at 3.4GHZ. \\

Challenges of NFV Why attaining line speed for NFV is challenging Time Budget per packet (64 byte) 6.7ns 67.2ns 16.8ns 1.7ns (~23 10Gbps 40Gbps 100Gbps Standard Approved 2002 2010 400Gbps 2017 How could the NFV execution target (multi-core server) support the tight budget? How could the NFV execution target support the very tight time budget for packet processing?

NFV execution target : A Typical Multi-core Server Architecture
Intel(R) Xeon(R) Platinum 8168 CPU Let’s have a closer look at the Multi-core Server Architecture. \\ We take the intel xeon 8168 CPU as an example. The CPU has two NUMA nodes. Each node has 24 CPU cores \\ each core has dedicated L1/L2 cache, \\ and a shared L3 cache \\ While DRAM and NICs are connected to the CPU cores through un-core resources on the chip, such as internal-memory controller and DDIO. Note that DDIO helps NIC send the packets directly to the cache. \\ Two NUMA nodes are connected with an Ultra Path Interconnect called UPI. \\

Memory hierarchy of the NUMA Server
Intel(R) Xeon(R) Platinum 8168 CPU L1 : 4 cycles (~1ns) 32KB L2 : 14 cycles (~4ns) 1MB Local L3/LLC : cycles (~13-20ns) 33MB Local DRAM : 250 cycles (~70ns) 192GB For each CPU core in our server clocked at 3.4GHz, \\ accessing the dedicated L1/L2 cache is fast, about 1 and 4ns respectively. However the cache size is small, only 32KB for L1 and 1MB for L2. \\ The size increased to 33MB for last level L3 cache, however the latency also increase up to 20 ns.\\ For local DRAM the latency further increased to about 70ns with the size of hundreds GB. \\

Intel(R) Xeon(R) Platinum 8168 CPU L1 : 4 cycles (~1ns) 32KB L2 : 14 cycles (~4ns) 1MB Local L3/LLC : cycles (~13-20ns) 33MB Local DRAM : 250 cycles (~70ns) 192GB For the inter core communication, \\ the data is transferred through the shared L3 cache or DRAM,\\ thus the inter-core transfer overhead is at least L3 access latency, up to DRAM access latency \\ Minimal inter-core communication overhead: Local L3 access latency

Intel(R) Xeon(R) Platinum 8168 CPU L1 : 4 cycles (~1ns) 32KB L2 : 14 cycles (~4ns) 1MB Remote L3 (~40ns) Local L3/LLC : cycles (~13-20ns) 33MB NUMA penalty Remote DRAM (~125ns) Local DRAM : 250 cycles (~70ns) 192GB Local latency + NUMA penalty Both inter-core and inter-NUMA communication is expensive.

Time Budget per packet (64 byte) 6.7ns 67.2ns 16.8ns 1.7ns (~23 10Gbps 40Gbps 100Gbps 400Gbps Intel(R) Xeon(R) Platinum 8168 CPU L1 : 4 cycles (~1ns) 32KB L2 : 14 cycles (~4ns) 1MB Remote L3 (~40ns) Local L3/LLC : cycles (~13-20ns) 33MB NUMA penalty Remote DRAM (~125ns) Local DRAM : 250 cycles (~70ns) 192GB Ensuring most NF operations are L1/L2 bound is important for 100Gbps line speed!

NFs Operations An example NF: Network Monitor
Maintains a per-host (src_ip) counter and update the counter per packet Network Monitor (NM) Two types of NF operations (1) Packet operation

NFs Operations An example NF: Network Monitor
Maintains a per-host (src_ip) counter and update the counter per packet Network Monitor (NM) Two types of NF operations (1) Packet operation (2) State operation: counter update Counter state location Maximum operation/s L1 850 M L2 242 M L3/LLC 𝟔𝟐 M DRAM 𝟏𝟒 M 100Gbps : Most state should be packed in L1/L2 cache for 100Gbps line rate

SFC and NFV Execution Models
NFs are chained together as SFC An example SFC Two existing NFV execution models Run-to-Completion (RTC) Pipeline (PL) [NetBricks, OSDI’16] [Metron, NSDI’18] … [NetVM, NSDI’14] [ClickOS, NSDI’14] [E2, SOSP’15] …

NFs are chained together as SFC An example SFC Two existing NFV execution models Per-NF state cache locality Inter-core transfer overhead Run-to-Completion (RTC) Pipeline (PL) [NetBricks, OSDI’16] [Metron, NSDI’18] … [NetVM, NSDI’14] [ClickOS, NSDI’14] [E2, SOSP’15] …

NFs are chained together as SFC An example SFC Two existing NFV execution models Per-NF state cache locality Inter-core transfer overhead Run-to-Completion (RTC) Flexibility for scaling Pipeline (PL) [NetBricks, OSDI’16] [Metron, NSDI’18] … [NetVM, NSDI’14] [ClickOS, NSDI’14] [E2, SOSP’15] …

Evaluation of NFV Execution Models
PL RTC Maximum 100Gbps line speed on commodity multi-core servers Scalability and Flexibility afforded by software Based on the previous understanding on two models, it is hard to simply say which model is a better choice for a given service function chain. \\ To better understand the pros and cons of two models, next let’s evaluate their performance using real testbed \\ Which model is better? Let’s evaluate the performance of two models!

Performance : Testbed 18 RTC instances using 18 cores VS.
6 PL instances using 18 cores The example SFC: TRex Our testbed contains two servers, one runs the example SFC and the other works as the traffic generator. \\ Both servers are equipped with the intel xeon 8168 CPU and Mellanox 100G NICs.\\ For RTC, we run 18 instances of the chain using 18 cores, each RTC instance use one core; \\ For PL, we use the same number of 18 cores which support 6 PL instances, each instance use 3 cores. \\ Intel Xeon Platinum 8168 CPU Mellanox ConnectX-5 100G NIC ? Gbps 100 Gbps Server running SFC Traffic generator server

Performance : RTC vs. PL ALL NF state are small to be packed into L1/L2 cache Key Observations For small frame size (64B - 512B), the RTC is × faster than the PL With the increase of frame size, both RTC and PL reach 100Gbps line rate RTC suffers no inter-core transfer overhead cores 18 cores

Inter-core transfer overhead
Intel VTune toll provides “precise” analysis based on hardware events We measure the NFs packet access cycles under two models Run-To-Completion (RTC) 8.3 7.5 Pipeline (PL) 75.8 59.2 Inter-core Transfer Overhead ~68 cycles ~52 cycles The L3 access latency is cycles

Dose RTC always win? Does RTC always win?

Scalability and flexibility : RTC and PL
The traffic has an ‘elephant’ host We want to scaling out the bottleneck NFs in the SFC An example SFC Ideal scaling out of the SFC when LB becomes the bottleneck How should we choose between two models ?

Issues when we scaling out under RTC model The traffic has an ‘elephant’ host Each core has to pack state for all NFs Ideal scaling out of the SFC Scaling of RTC model How to fit the state into L1/L2 cache?

Issues when we scaling out under RTC model The traffic has an ‘elephant’ host State are shared across NM instances Ideal scaling out of the SFC Scaling of RTC model How to fit the state into L1/L2 cache? How to minimize side effects on NM (inter-core sync. for ‘elephant’ host counter)?

If we scaling out under PL model Scaling of RTC model Scaling of PL model PL provides finer-grained scalability to eliminate shared state and keep cache locality (fit state into L1/L2)

Performance evaluation Both RTC and PL running on 20 cores Evenly steer 1k “elephant” host traffic Frame size ranges from 64B to 1518B PL achieves better performance than RTC! RTC can suffer more overhead than inter-core transfer in PL PL maintains better cache locality than RTC Inter-core sync. for ‘elephant’ host counter (Access to the shared counter is L3/DRAM bounded)

Summary We are the first to take an in-depth look into the NFV execution models for 100Gbps line speed SFC packet processing NFV system needs to take NUMA architecture into consideration to achieve the best performance, scalability and flexibility Both RTC and PL models have pros and cons RTC has better performance than PL in general without inter-core transfer overhead PL has better scalability and flexibility when scaling SFC out A novel execution model is promising by combing the strength of both RTC & PL See more details in our paper

Thanks for your attention! Any Questions?
Thanks for your attention! I’d like to take any questions.

A Closer Look at NFV Execution Models

Similar presentations

Presentation on theme: "A Closer Look at NFV Execution Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Closer Look at NFV Execution Models

Similar presentations

Presentation on theme: "A Closer Look at NFV Execution Models"— Presentation transcript:

Similar presentations

About project

Feedback