NVMe™/TCP Development Status and a Case study of SPDK User Space Solution 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Sagi Grimberg,

Slides:



Advertisements
Similar presentations
Device Drivers. Linux Device Drivers Linux supports three types of hardware device: character, block and network –character devices: R/W without buffering.
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
Chapter 7 Protocol Software On A Conventional Processor.
ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
3.5 Interprocess Communication
Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
1 I/O Management in Representative Operating Systems.
5/8/2006 Nicole SAN Protocols 1 Storage Networking Protocols Nicole Opferman CS 526.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
1 Some Context for This Session…  Performance historically a concern for virtualized applications  By 2009, VMware (through vSphere) and hardware vendors.
LWIP TCP/IP Stack 김백규.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
The NE010 iWARP Adapter Gary Montry Senior Scientist
A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Srihari Makineni & Ravi Iyer Communications Technology Lab
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Full and Para Virtualization
iSER update 2014 OFA Developer Workshop Eyal Salomon
The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Tim Hamilton.
Internet Protocol Storage Area Networks (IP SAN)
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
© 2007 EMC Corporation. All rights reserved. Internet Protocol Storage Area Networks (IP SAN) Module 3.4.
Under the Hood with NVMe over Fabrics
Packet processed storage in a software defined world Ash Young fd.io Foundation1.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Tgt: Framework Target Drivers FUJITA Tomonori NTT Cyber Solutions Laboratories Mike Christie Red Hat, Inc Ottawa Linux.
Shaopeng, Ho Architect of Chinac Group
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Processes and threads.
Linux Details: Device Drivers
LWIP TCP/IP Stack 김백규.
Chapter 4: Multithreaded Programming
Fabric Interfaces Architecture – v4
TLS Receive Side Crypto Offload to NIC
Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson
Boost Linux Performance with Enhancements from Oracle
Enabling the NVMe™ CMB and PMR Ecosystem
CSCI 315 Operating Systems Design
Rob Davis, Mellanox Ilker Cebeli, Samsung
Internetworking: Hardware/Software Interface
CPSC 457 Operating Systems
Storage Networking Protocols
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
Chapter 2: The Linux System Part 1
Xen Network I/O Performance Analysis and Opportunities for Improvement
I/O Systems I/O Hardware Application I/O Interface
CS703 - Advanced Operating Systems
Integrating DPDK/SPDK with storage application
Lecture Topics: 11/1 General Operating System Concepts Processes
Linux Details: Device Drivers
Chapter 2: The Linux System Part 5
Accelerating Applications with NVM Express™ Computational Storage 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Prepared by Stephen.
NVMe.
Chapter 1: Introduction CSS503 Systems Programming
Chapter 13: I/O Systems.
Mr. M. D. Jamadar Assistant Professor
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

NVMe™/TCP Development Status and a Case study of SPDK User Space Solution 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Sagi Grimberg, Lightbits Labs Ben Walker and Ziye Yang, Intel

NVMe™/TCP Status TP ratified @ Nov 2018 Linux Kernel NVMe/TCP inclusion made v5.0 Interoperability tested with vendors and SPDK Running in large-scale production environments (backported though) Main TODOs: TLS support Connection Termination rework I/O Polling (leverage .sk_busy_loop() for polling) Various performance optimizations (mainly on the host driver) A few minor Specification wording issues to fixup

Performance: Interrupt Affinity In NVMe™ we pay a close attention to steer an interrupt to the application CPU core In TCP Networking: TX interrupts are usually steered to the submitting CPU core (XPS) RX interrupts steering is determined by: Hash(5-tuple) That is not local to the application CPU core But, aRFS comes to the rescue! RPS mechanism is offloaded to the NIC NIC driver implements: .ndo_rx_flow_steer The RPS stack learns where the CPU core that processes the stream and teaches the HW with a dedicated steering rule.

Canonical Latency Overhead Comparison The measurement tests the latency overhead for a QD=1 I/O operation NVMe™/TCP is faster than iSCSI but slower than NVMe/RDMA

Performance: Large Transfers Optimizations NVMe™ usually impose minor CPU overhead for large I/O <= 8K (two pages) only assign 2 pointers > 8K setup PRP/SGL In TCP networking: TX large transfers involves higher overhead for TCP segmentation and copy Solution: TCP Segmentation Offload (TSO) and .sendpage() RX large transfers involves higher overhead for more interrupts and copy Solution: Generic Receive Offload (GRO) and Adaptive Interrupt Moderation Still more overhead than PCIe though...

Throughput Comparison Single-threaded NVMe™/TCP achieves 2x better throughput NVMe/TCP scales to saturate 100Gb/s for 2-3 threads however iSCSI is blocked

NVMe™/TCP Parallel Interface Each NVMe queue maps to a dedicated bidirectional TCP connection No controller-wide sequencing No controller-wide reassembly constraints

4K IOPs Scalability iSCSI is serialized heavily and cannot scale with the number of threads NVMe™/TCP scales very well reaching over 2M 4K IOPs

Performance: Read vs. Write I/O Queue Separation Common problem with TCP/IP is head-of-queue (HOQ) blocking For example, a small 4KB Read is blocked behind a large 1MB Write to complete data transfer Linux supports Separate Queue mappings since v5.0 Default Queue Map Read Queue Map Poll Queue Map NVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking. In the Future can contain Priority Based Queue Arbitration to eliminate even further

Performance: Read vs. Write I/O Queue Separation NVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking. Future: Priority Based Queue Arbitration can reduce impact even further

Mixed Workloads Test Test the impact of Large Write I/O on Read Latency 32 “readers” issuing synchronous READ I/O 1 Writer that issues 1MB Writes @ QD=16 iSCSI Latencies collapse in the presence of Large Writes Heavy serialization over a single channel NVMe™/TCP is very much on-par with NVMe/RDMA

Commercial Performance Software NVMe™/TCP controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts

Commercial Performance – Mixed Workloads Software NVMe™/TCP Controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts

Slab, sendpage and kernel hardening We never copy buffers NVMe™/TCP TX side (not even PDU headers) As a proper blk_mq driver, Our PDU headers were preallocated in advance PDU headers were allocated as normal Slab objects Can a Slab original allocation be sent to the network with Zcopy? Linux-mm seemed to agree we can (Discussion)... But, every now and then, under some workloads the kernel would panic... kernel BUG at mm/usercopy.c:72! CPU: 3 PID: 2335 Comm: dhclient Tainted: G O 4.12.10-1.el7.elrepo.x86_64 #1 ... Call Trace: copy_page_to_iter_iovec+0x9c/0x180 copy_page_to_iter+0x22/0x160 skb_copy_datagram_iter+0x157/0x260 packet_recvmsg+0xcb/0x460 sock_recvmsg+0x3d/0x50 ___sys_recvmsg+0xd7/0x1f0 __sys_recvmsg+0x51/0x90 SyS_recvmsg+0x12/0x20 entry_SYSCALL_64_fastpath+0x1a/0xa5

Slab, sendpage and kernel hardening Root Cause: In high queue depth, TCP stack coalesce PDU headers into a single fragment At the same time, we have userspace programs applying bpf packet filters (in this case dhclient) Kernel Hardening applies heuristics to catch exploits: In this case, panic if usercopy attempts to copy skbuff that contains a fragment that cross the Slab object boundary Resolution: Don’t allocate PDU headers from the Slab allocators Instead use a queue private page_frag_cache This resolved the panic issue But also improved the page referencing efficiency on the TX path!

Ecosystem Linux kernel support is upstream since v5.0 (both host and NVM subsystem) https://lwn.net/Articles/772556/ https://patchwork.kernel.org/patch/10729733/ SPDK support (both host and NVM subsystem) https://github.com/spdk/spdk/releases https://spdk.io/news/2018/11/15/nvme_tcp/ NVMe™ compliance program Interoperability testing started at UNH-IOL in the Fall of 2018 Formal NVMe compliance testing at UNH-IOL planned to start in the Fall of 2019 For more information see: https://nvmexpress.org/welcome-nvme-tcp-to-the-nvme-of-family-of-transports/

Summary NVMe™/TCP is a new NVMe-oF™ transport NVMe/TCP is specified by TP 8000 (available at www.nvmexpress.org) Since TP 8000 is ratified, NVMe/TCP is officially part of NVMe-oF 1.0 and will be documented as part of the next NVMe-oF specification release NVMe/TCP offers a number of benefits Works with any fabric that support TCP/IP Does not require a “storage fabric” or any special hardware Provides near direct attached NAND SSD performance Scalable solution that works within a data center or across the world

Storage Performance Development Kit User-space C Libraries that implement a block stack Includes an NVMe™ driver Full featured block stack Open Source 3-clause BSD Asynchronous, event loop, polling design strategy Very different than traditional OS stack (but very similar to the new io_uring in Linux) 100% focus on performance (latency and bandwidth) https://spdk.io

NVMe-oF™ History NVMe over Fabrics Host NVMe™ over Fabrics Target July 2016: Initial Release (RDMA Transport) July 2016 – Oct 2018: Hardening, Feature Completeness Performance Improvements (scalability) Design changes (introduction of poll groups) Jan 2019: TCP Transport Compatible with Linux kernel Based on POSIX sockets (option to swap in VPP) NVMe over Fabrics Host December 2016: Initial Release (RDMA Transport) July 2016 – Oct 2018: Hardening, Feature Completeness Performance Improvements (zero copy) Jan 2019: TCP Transport Compatible with Linux kernel Based on POSIX sockets (option to swap in VPP)

NVMe-oF™ Target Design Overview Target spawns one thread per core which runs an event loop Event loop is called a “poll group” New connections (sockets) are assigned to a poll group when accepted Poll group polls the sockets it owns using epoll/kqueue for incoming requests Poll group polls dedicated NVMe™ queue pairs on back end for completions (indirectly, via block device layer) I/O processing is run-to-completion mode and entirely lock-free.

Transport Abstraction Adding a New Transport Transports are abstracted away from the common NVMe-oF™ code via a plugin system Plugins are a set of function pointers that are registered as a new transport. TCP Transport implemented in lib/nvmf/tcp.c Transport Abstraction Socket operations are also abstracted behind a plugin system POSIX sockets and VPP supported FC? RDMA TCP Posix VPP

Future Work Better socket syscall batching! Calling epoll_wait, readv, and writev over and over isn’t effective. Need to batch the syscalls for a given poll group. Abuse libaio’s io_submit? io_uring? Can likely reduce number of syscalls by a factor of 3 or 4. Better integration with VPP (eliminate a copy) Integrate with TCP acceleration available in NICs NVMe-oF offload support