Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST

Slides:

Advertisements

Similar presentations

Ethernet Over PCI Express Presented by Kallol Biswas

Advertisements

A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.

Chabot College Chapter 2 Review Questions Semester IIIELEC Semester III ELEC

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Chapter 8 Hardware Conventional Computer Hardware Architecture.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.

1 Performance Evaluation of Gigabit Ethernet & Myrinet

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

1 K. Salah Module 4.3: Repeaters, Bridges, & Switches Repeater Hub NIC Bridges Switches VLANs GbE.

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

IWARP Redefined: Scalable Connectionless Communication Over High-Speed Ethernet M. J. Rashti, R. E. Grant, P. Balaji and A. Afsahi.

Hosting Virtual Networks on Commodity Hardware VINI Summer Camp.

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.

The NE010 iWARP Adapter Gary Montry Senior Scientist

“ PC  PC Latency measurements” G.Lamanna, R.Fantechi & J.Kroon (CERN) TDAQ WG –

Penn State CSE “Optimizing Network Virtualization in Xen” Aravind Menon, Alan L. Cox, Willy Zwaenepoel Presented by : Arjun R. Nath.

LAN Switching and Wireless – Chapter 1

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Srihari Makineni & Ravi Iyer Communications Technology Lab

Network Architecture for the LHCb DAQ Upgrade Guoming Liu CERN, Switzerland Upgrade DAQ Miniworkshop May 27, 2013.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Using Heterogeneous Paths for Inter-process Communication in a Distributed System Vimi Puthen Veetil Instructor: Pekka Heikkinen M.Sc.(Tech.) Nokia Siemens.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

Use case of RDMA in Symantec storage software stack Om Prakash Agarwal Symantec.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Presented by: Xianghan Pei

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Remigius K Mommsen Fermilab CMS Run 2 Event Building.

Javier Argomedo (ESO/DoE/CSE) - Instrument Control Systems 2014 E-ELT M1 Local Control System Network and LCU Prototyping Motivation Requirements Design.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

Considerations for Benchmarking Virtual Networks Samuel Kommu, Jacob Rapp, Ben Basler,

FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho,

Open-source routing at 10Gb/s Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) SNCNW May 2009 Project grants: Internetstiftelsen (IIS) Equipment:

Graciela Perera Department of Computer Science and Information Systems Slide 1 of 18 INTRODUCTION NETWORKING CONCEPTS AND ADMINISTRATION CSIS 3723 Graciela.

Network Requirements for Resource Disaggregation

Balazs Voneki CERN/EP/LHCb Online group

High Performance and Reliable Multicast over Myrinet/GM-2

Instructor Materials Chapter 1: LAN Design

Problem: Internet diagnostics and forensics

BESS: A Virtual Switch Tailored for NFV

Connecting Network Components

EE 122: Lecture 19 (Asynchronous Transfer Mode - ATM)

Scaling the Network: The Internet Protocol

Alternative system models

Chapter 4 Data Link Layer Switching

Multi-PCIe socket network device

Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson

Toward Effective and Fair RDMA Resource Sharing

Xen Network I/O Performance Analysis and Opportunities for Improvement

Storage area network and System area network (SAN)

Greg Bell Business Development Mgr Industrial & Security Markets

Fast Congestion Control in RDMA-Based Datacenter Networks

Chapter 4 Network Layer Computer Networking: A Top Down Approach 5th edition. Jim Kurose, Keith Ross Addison-Wesley, April Network Layer.

Scaling the Network: The Internet Protocol

An Engineering Approach to Computer Networking

A Closer Look at NFV Execution Models

Cluster Computers.

Presentation transcript:

HiPINEB 2016 Exploring Low-latency Interconnect for Scaling Out Software Routers Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST 2016. 3. 12. hello, 2nd author majority work done by sangwook advised by sue moon.

Motivations for Software Router Packet processing on commodity x86 servers Low-cost alternatives for HW routers Programmability & cost-effectiveness Lower performance than HW routers Commercial products: Vyatta (Brocade), 1000V (Cisco) Well-known principles for high performance Batching, pipelining, and parallelization + -

Scaling Out Software Routers Why scale out? Because single boxes have: Limited port density as a router (e.g., 40 PCIe lanes for a latest single Xeon CPU) RouteBricks [SOSP09], ScaleBricks [SIGCOMM15], E2 [SOSP15] Interconnect medium: Ethernet Topology: full mesh / central switch single box limitation two known work with isolated interconnect routebricks -> focus on parallelism & per-server capacity planning scalebricks -> focus on optimization of FIB E2 -> mixed interconnect + external links, limiting design choice for interconnect RouteBricks ScaleBricks

Our Problem: Latency Penalty Two factors of latency increase Multiple hops among router nodes Aggressive I/O and computation batching Our goal: allow maximum flexibility in future router cluster designs by minimizing interconnect overheads. main problem -> increase of latency two major sources -> multiple hops + aggressive batching num. of hops depend on topology -> no definite single ultimate design -> let’s maximize flexibility & possibility in designs by minimizing hop latency batching is essential -> let’s do it in better way

Our Solutions RDMA (Remote DMA) as the interconnect medium Needs to keep throughput high Hardware-assisted I/O batching Offers lower latency compared to software-side batching Enhances RDMA throughput with small packets

RDMA as Interconnect It is an unexplored design choice for routers. All external connections need to be compatible: Ethernet. Scaling out opens a new design space: interconnect. RDMA provides low latency and high throughput. It reduces the burden of host CPUs, by offloading most functionalities of network stacks to hardware. unprecendented -> unexplored (우리가 최초라는 걸 강조하는 애니메이션 살짝) RDMA에서도 hw-assisted batching을 지원한다는 언급하면서 뒤랑 연결.

Hardware-assisted Batching NICs often provide HW-based segmentation. In Ethernet: for jumbo-frames that don’t fit with page size In RDMA: for fast access to remote pages Batching reduces per-packet overheads. It saves bandwidth for repeated protocol headers and computations for parsing/generating them.

Our Contributions We compare throughput & latency of: Combinations of different RoCE transport/operation modes RoCE (RDMA over Converged Ethernet) vs. Ethernet Result Highlights In RoCE, UC transport type and SEND/RECV ops offer the maximum performance. RoCE latency is consistently lower than Ethernet. RoCE throughput is lower than Ethernet in small packets. HW-assisted batching improves RoCE throughput for small packets to be comparable to Ethernet. Check out the paper Check out the paper

Experiment Setup Packet Generator Packet Forwarder Commodity server (Intel Xeon E5-2670v3 @ 2.6 GHz, 32 GB RAM) Commodity server (Intel Xeon E5-2670v3 @ 2.6 GHz, 32 GB RAM) Switch (Mellanox SX1036) RDMA-capable NICs (Mellanox ConnectX-3, 40 Gbps per port) RDMA-capable NICs (Mellanox ConnectX-3, 40 Gbps per port) pktgen + pktfwd both use commodity server hardware & equipped with two RNICs standard software stack used single core only -> to isolate perf. from sync artifacts Software stack: Ubuntu 14.04 / kernel 3.16 / Mellanox OFED 3.0.2 / Intel DPDK 2.1

Latency Measurement For both RoCE & Ethernet: API-level granularity 1-hop latency = ( (T4 – T1) – (T3 – T2) ) / 2 (T4 – T1): round-trip time between TX/RX APIs in sender (T3 – T2): elapsed time between RX/TX APIs in receiver T1 T2 T4 T3 different APIs -> needs fair comparison -> API-level timestamps our one-hop latency definition: half of diff of time spent in sender/receiver include driver runtime + NIC processing time + physical transmission delay

Throughput Measurement Bidirectional throughput through 40GbE Protocol overhead included Ethernet = 14 bytes / packet RDMA over Ethernet = 72 bytes / message Packet size: 64 to 1500 bytes Typical network traffic for routers Throughput: 10 sec average bidirectional throughput over 40Gbps links to check max.hw capacity -> protocol overhead included packet sizes -> typical network traffic thruput avg. 10 sec

RoCE vs. Ethernet: 1-hop Latency (w/o batching) RoCE: all cases < 3 usec Ethernet: without I/O batching, max 10 usec at MTU all values are median 10 usec seem low -> I/O batching is essential -> next slide == sampling 방법: Ethernet은 pspgen avg latency 측정, RoCE는 직접 짠 프로그램으로 sampling Ethernet 경우는 아쉽게도 full CDF/quantile 데이터를 수집하지 않음… Deviations at 10%/90% quantile are within 0.5 usec for RoCE

Impact of I/O Batching in Ethernet Nearly 30x of 2.5 usec in RoCE again, 10.3 usec without batching batch size increase -> latency increase, up to 73 usec, 30x of RoCE critical for horizontal scaling as reference point: Cisco whitepaper -> 40 Gbps product keep under 27 usec Packet size fixed to 1500 bytes

RoCE vs. Ethernet: Throughput Ethernet performs better than RoCE in all packet sizes. Throughput gap is worse in smaller packets. bidirectional throughput # all packet size -> ether thruput > RoCE thruput # more gap in < 512 bytes RoCE has more overhead: 72-byte protocol header in RDMA + lower processing speed than CPU

Mixed Results of RoCE for Routers RDMA keeps latency under 3 usec in all packet sizes. Up to 30x lower than Ethernet in the same conditions RDMA throughput < Ethernet throughput when packet size ≤ 1500B. + - Our Breakthrough: Exploit HW-assisted Batching! good part bad part # we strike this issue by hw batching

Potential Benefits of RoCE Batching With packets ≥ 1500 bytes, RoCE achieves line rates & keeps latency under 17 usec. -- 17 usec -> RDMA 32K = Ethernet 20x 1.5K랑 비교하면 좋겠는데... (근홍)

How HW-assisted Batching Works “Sender” Host “Receiver” Host Ethernet packets Application Application Combined RoCE message (bypass OS network stack) (bypass OS network stack) Kernel Kernel Ethernet NIC RoCE NIC RoCE header RoCE NIC Ethernet NIC External Network HW-assisted means: NIC performs split & reassembly of messages using scatter/gather figure shows how: sender pass ptrs to multiple msgs to NIC, NIC assembles, transferred upon single msg, rx NIC splits into separate pkts we primarily focus on interconnect; greyed out parts: edge nodes in real router cluster, not included in our experiment -- Ethernet scatter/gather랑 비교 안한 거 질문 대비. (근홍) message format? External Network Interconnect

HW-assisted Batching: Throughput 3.7~4.8x improvements for small packets Generally best batch size: 16 thruput higher or close to Ethernet with I/O batching blue bars -> RoCE with different batch sizes, 2 to 32 purple bars -> Ethernet with batch size 32 line-rate thruput when packet size ≥ 512B # 3.7~4.8x increase when packet size ≤ 256B # max thruput at batch size 16 -- pkt size mix에 질문 들어오면 어떻게 할지.

HW-assisted Batching: Latency 5.4x lower than Ethernet with same batch size linearly increases as the batch/packet size increases. # max latency: 13.5 usec @ 1500 bytes, 32 batch size still 5.4x lower than Ethernet, 73 usec under same cond. (Deviations at 10%/90% quantile are within 0.5 usec from median)

Summary & Conclusion RDMA is a valid alternative as an interconnect of scaled-out SW routers. It reduces I/O latency up to 30x compared to Ethernet Challenge is its low throughput in packet sizes ≤ 1500 bytes. We exploit HW-assisted batching to enhance throughput. It batches multiple Ethernet packets in a single RoCE message. Our scheme achieves throughput higher or close to Ethernet while still keeps 1-hop latency under 14 usec.

Q & A 이상으로 제 발표를 마치겠습니다. 감사합니다.

Transfer Operations and Connection Types 4 types of RDMA transfer operation READ, WRITE, SEND and RECV operations We use SEND & RECV, which is more suitable to latency-critical applications like packet processing. 3 transport types for RDMA connection RC (Reliable Connection), UC (Unreliable Connection), UD (Unreliable Datagram) We choose UC type, which shows highest throughput of all types.

4 Types of RDMA Transfer Operations READ & WRITE operations One-sided operations: receive side’s CPU is unaware of the transfer READ “pulls” data from remote memory and WRITE “pushes” data into remote memory. SEND & RECV operations Two-sided operations: CPU of both side is involved Sender sends data using SEND, receiver uses RECV to receive data We use SEND & RECV, which is more suitable to latency- critical applications like packet processing.

3 Types of RDMA Connections 3 transport types for RDMA connection RC (Reliable Connection), UC (Unreliable Connection), UD (Unreliable Datagram) Connected types (RC & UC) support message sizes up to 2GB but requires fixed sender-receiver connection. ACK/NACK protocol of RC enables it to guarantee lossless transfer but consumes link bandwidth. UD type does not require fixed connection but its message size is limited to MTU and requires additional 40-byte protocol overhead. UC type shows highest throughput of all types and we use it in this work.

Related Work Implementation of distributed key-value stores Pilaf [ATC 13’], HERD [SIGCOMM 14’], FaRM [NSDI 14’] Acceleration of existing applications MPI [ICS 03’], Hbase [IPDPS 12’], HDFS [SC 12’], Memcached [ICPP 11’] They replace socket interface with RDMA transfer operations RDMA-like interconnects for rack-scale computing Scale-out NUMA [ASPLOS 14’] , R2C2 [SIGCOMM 15’], Marlin [ANCS 14’] 맨 뒤에

Future Works Examine the effect of the number of RDMA connections on performance Measure throughput and latency using real traffic traces Implement scaled-out SW router prototype using RDMA interconnect Cluster composed of Ethernet ports for external interface and RoCE ports for interconnect 맨 뒤에

etc. The ”Barcelona” icon in the title slide is by Adam Whitcroft sponsored by OffScreen.