Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST

HiPINEB 2016 Exploring Low-latency Interconnect for Scaling Out Software Routers
Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST hello, 2nd author majority work done by sangwook advised by sue moon.

Motivations for Software Router
Packet processing on commodity x86 servers Low-cost alternatives for HW routers Programmability & cost-effectiveness Lower performance than HW routers Commercial products: Vyatta (Brocade), 1000V (Cisco) Well-known principles for high performance Batching, pipelining, and parallelization + -

Scaling Out Software Routers
Why scale out? Because single boxes have: Limited port density as a router (e.g., 40 PCIe lanes for a latest single Xeon CPU) RouteBricks [SOSP09], ScaleBricks [SIGCOMM15], E2 [SOSP15] Interconnect medium: Ethernet Topology: full mesh / central switch single box limitation two known work with isolated interconnect routebricks -> focus on parallelism & per-server capacity planning scalebricks -> focus on optimization of FIB E2 -> mixed interconnect + external links, limiting design choice for interconnect RouteBricks ScaleBricks

Our Problem: Latency Penalty
Two factors of latency increase Multiple hops among router nodes Aggressive I/O and computation batching Our goal: allow maximum flexibility in future router cluster designs by minimizing interconnect overheads. main problem -> increase of latency two major sources -> multiple hops + aggressive batching num. of hops depend on topology -> no definite single ultimate design -> let’s maximize flexibility & possibility in designs by minimizing hop latency batching is essential -> let’s do it in better way

Our Solutions RDMA (Remote DMA) as the interconnect medium
Needs to keep throughput high Hardware-assisted I/O batching Offers lower latency compared to software-side batching Enhances RDMA throughput with small packets

RDMA as Interconnect It is an unexplored design choice for routers.
All external connections need to be compatible: Ethernet. Scaling out opens a new design space: interconnect. RDMA provides low latency and high throughput. It reduces the burden of host CPUs, by offloading most functionalities of network stacks to hardware. unprecendented -> unexplored (우리가 최초라는 걸 강조하는 애니메이션 살짝) RDMA에서도 hw-assisted batching을 지원한다는 언급하면서 뒤랑 연결.

Hardware-assisted Batching
NICs often provide HW-based segmentation. In Ethernet: for jumbo-frames that don’t fit with page size In RDMA: for fast access to remote pages Batching reduces per-packet overheads. It saves bandwidth for repeated protocol headers and computations for parsing/generating them.

Our Contributions We compare throughput & latency of:
Combinations of different RoCE transport/operation modes RoCE (RDMA over Converged Ethernet) vs. Ethernet Result Highlights In RoCE, UC transport type and SEND/RECV ops offer the maximum performance. RoCE latency is consistently lower than Ethernet. RoCE throughput is lower than Ethernet in small packets. HW-assisted batching improves RoCE throughput for small packets to be comparable to Ethernet. Check out the paper Check out the paper

Experiment Setup Packet Generator Packet Forwarder
Commodity server (Intel Xeon 2.6 GHz, 32 GB RAM) Commodity server (Intel Xeon 2.6 GHz, 32 GB RAM) Switch (Mellanox SX1036) RDMA-capable NICs (Mellanox ConnectX-3, 40 Gbps per port) RDMA-capable NICs (Mellanox ConnectX-3, 40 Gbps per port) pktgen + pktfwd both use commodity server hardware & equipped with two RNICs standard software stack used single core only -> to isolate perf. from sync artifacts Software stack: Ubuntu / kernel 3.16 / Mellanox OFED / Intel DPDK 2.1

Latency Measurement For both RoCE & Ethernet: API-level granularity
1-hop latency = ( (T4 – T1) – (T3 – T2) ) / 2 (T4 – T1): round-trip time between TX/RX APIs in sender (T3 – T2): elapsed time between RX/TX APIs in receiver T1 T2 T4 T3 different APIs -> needs fair comparison -> API-level timestamps our one-hop latency definition: half of diff of time spent in sender/receiver include driver runtime + NIC processing time + physical transmission delay

Throughput Measurement
Bidirectional throughput through 40GbE Protocol overhead included Ethernet = 14 bytes / packet RDMA over Ethernet = 72 bytes / message Packet size: 64 to 1500 bytes Typical network traffic for routers Throughput: 10 sec average bidirectional throughput over 40Gbps links to check max.hw capacity -> protocol overhead included packet sizes -> typical network traffic thruput avg. 10 sec

RoCE vs. Ethernet: 1-hop Latency
(w/o batching) RoCE: all cases < 3 usec Ethernet: without I/O batching, max 10 usec at MTU all values are median 10 usec seem low -> I/O batching is essential -> next slide == sampling 방법: Ethernet은 pspgen avg latency 측정, RoCE는 직접 짠 프로그램으로 sampling Ethernet 경우는 아쉽게도 full CDF/quantile 데이터를 수집하지 않음… Deviations at 10%/90% quantile are within 0.5 usec for RoCE

Impact of I/O Batching in Ethernet
Nearly 30x of 2.5 usec in RoCE again, 10.3 usec without batching batch size increase -> latency increase, up to 73 usec, 30x of RoCE critical for horizontal scaling as reference point: Cisco whitepaper -> 40 Gbps product keep under 27 usec Packet size fixed to 1500 bytes

RoCE vs. Ethernet: Throughput
Ethernet performs better than RoCE in all packet sizes. Throughput gap is worse in smaller packets. bidirectional throughput # all packet size -> ether thruput > RoCE thruput # more gap in < 512 bytes RoCE has more overhead: 72-byte protocol header in RDMA + lower processing speed than CPU

Mixed Results of RoCE for Routers
RDMA keeps latency under 3 usec in all packet sizes. Up to 30x lower than Ethernet in the same conditions RDMA throughput < Ethernet throughput when packet size ≤ 1500B. + - Our Breakthrough: Exploit HW-assisted Batching! good part bad part # we strike this issue by hw batching

Potential Benefits of RoCE Batching
With packets ≥ 1500 bytes, RoCE achieves line rates & keeps latency under 17 usec. -- 17 usec -> RDMA 32K = Ethernet 20x 1.5K랑 비교하면 좋겠는데... (근홍)

How HW-assisted Batching Works
“Sender” Host “Receiver” Host Ethernet packets Application Application Combined RoCE message (bypass OS network stack) (bypass OS network stack) Kernel Kernel Ethernet NIC RoCE NIC RoCE header RoCE NIC Ethernet NIC External Network HW-assisted means: NIC performs split & reassembly of messages using scatter/gather figure shows how: sender pass ptrs to multiple msgs to NIC, NIC assembles, transferred upon single msg, rx NIC splits into separate pkts we primarily focus on interconnect; greyed out parts: edge nodes in real router cluster, not included in our experiment -- Ethernet scatter/gather랑 비교 안한 거 질문 대비. (근홍) message format? External Network Interconnect

HW-assisted Batching: Throughput
3.7~4.8x improvements for small packets Generally best batch size: 16 thruput higher or close to Ethernet with I/O batching blue bars -> RoCE with different batch sizes, 2 to 32 purple bars -> Ethernet with batch size 32 line-rate thruput when packet size ≥ 512B # 3.7~4.8x increase when packet size ≤ 256B # max thruput at batch size 16 -- pkt size mix에 질문 들어오면 어떻게 할지.

HW-assisted Batching: Latency
5.4x lower than Ethernet with same batch size linearly increases as the batch/packet size increases. # max latency: bytes, 32 batch size still 5.4x lower than Ethernet, 73 usec under same cond. (Deviations at 10%/90% quantile are within 0.5 usec from median)

Summary & Conclusion RDMA is a valid alternative as an interconnect of scaled-out SW routers. It reduces I/O latency up to 30x compared to Ethernet Challenge is its low throughput in packet sizes ≤ 1500 bytes. We exploit HW-assisted batching to enhance throughput. It batches multiple Ethernet packets in a single RoCE message. Our scheme achieves throughput higher or close to Ethernet while still keeps 1-hop latency under 14 usec.

Q & A 이상으로 제 발표를 마치겠습니다. 감사합니다.

Transfer Operations and Connection Types
4 types of RDMA transfer operation READ, WRITE, SEND and RECV operations We use SEND & RECV, which is more suitable to latency-critical applications like packet processing. 3 transport types for RDMA connection RC (Reliable Connection), UC (Unreliable Connection), UD (Unreliable Datagram) We choose UC type, which shows highest throughput of all types.

4 Types of RDMA Transfer Operations
READ & WRITE operations One-sided operations: receive side’s CPU is unaware of the transfer READ “pulls” data from remote memory and WRITE “pushes” data into remote memory. SEND & RECV operations Two-sided operations: CPU of both side is involved Sender sends data using SEND, receiver uses RECV to receive data We use SEND & RECV, which is more suitable to latency- critical applications like packet processing.

3 Types of RDMA Connections
3 transport types for RDMA connection RC (Reliable Connection), UC (Unreliable Connection), UD (Unreliable Datagram) Connected types (RC & UC) support message sizes up to 2GB but requires fixed sender-receiver connection. ACK/NACK protocol of RC enables it to guarantee lossless transfer but consumes link bandwidth. UD type does not require fixed connection but its message size is limited to MTU and requires additional 40-byte protocol overhead. UC type shows highest throughput of all types and we use it in this work.

Related Work Implementation of distributed key-value stores
Pilaf [ATC 13’], HERD [SIGCOMM 14’], FaRM [NSDI 14’] Acceleration of existing applications MPI [ICS 03’], Hbase [IPDPS 12’], HDFS [SC 12’], Memcached [ICPP 11’] They replace socket interface with RDMA transfer operations RDMA-like interconnects for rack-scale computing Scale-out NUMA [ASPLOS 14’] , R2C2 [SIGCOMM 15’], Marlin [ANCS 14’] 맨 뒤에

Future Works Examine the effect of the number of RDMA connections on performance Measure throughput and latency using real traffic traces Implement scaled-out SW router prototype using RDMA interconnect Cluster composed of Ethernet ports for external interface and RoCE ports for interconnect 맨 뒤에

etc. The ”Barcelona” icon in the title slide is by Adam Whitcroft sponsored by OffScreen.

Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST

Similar presentations

Presentation on theme: "Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST

Similar presentations

Presentation on theme: "Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST "— Presentation transcript:

Similar presentations

About project

Feedback