Presentation is loading. Please wait.

Presentation is loading. Please wait.

NVMe™/TCP Development Status and a Case study of SPDK User Space Solution 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Sagi Grimberg,

Similar presentations


Presentation on theme: "NVMe™/TCP Development Status and a Case study of SPDK User Space Solution 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Sagi Grimberg,"— Presentation transcript:

1 NVMe™/TCP Development Status and a Case study of SPDK User Space Solution 2019 NVMe™ Annual Members Meeting and Developer Day March 19, Sagi Grimberg, Lightbits Labs Ben Walker and Ziye Yang, Intel

2 NVMe™/TCP Status TP ratified @ Nov 2018
Linux Kernel NVMe/TCP inclusion made v5.0 Interoperability tested with vendors and SPDK Running in large-scale production environments (backported though) Main TODOs: TLS support Connection Termination rework I/O Polling (leverage .sk_busy_loop() for polling) Various performance optimizations (mainly on the host driver) A few minor Specification wording issues to fixup

3 Performance: Interrupt Affinity
In NVMe™ we pay a close attention to steer an interrupt to the application CPU core In TCP Networking: TX interrupts are usually steered to the submitting CPU core (XPS) RX interrupts steering is determined by: Hash(5-tuple) That is not local to the application CPU core But, aRFS comes to the rescue! RPS mechanism is offloaded to the NIC NIC driver implements: .ndo_rx_flow_steer The RPS stack learns where the CPU core that processes the stream and teaches the HW with a dedicated steering rule.

4 Canonical Latency Overhead Comparison
The measurement tests the latency overhead for a QD=1 I/O operation NVMe™/TCP is faster than iSCSI but slower than NVMe/RDMA

5 Performance: Large Transfers Optimizations
NVMe™ usually impose minor CPU overhead for large I/O <= 8K (two pages) only assign 2 pointers > 8K setup PRP/SGL In TCP networking: TX large transfers involves higher overhead for TCP segmentation and copy Solution: TCP Segmentation Offload (TSO) and .sendpage() RX large transfers involves higher overhead for more interrupts and copy Solution: Generic Receive Offload (GRO) and Adaptive Interrupt Moderation Still more overhead than PCIe though...

6 Throughput Comparison
Single-threaded NVMe™/TCP achieves 2x better throughput NVMe/TCP scales to saturate 100Gb/s for 2-3 threads however iSCSI is blocked

7 NVMe™/TCP Parallel Interface
Each NVMe queue maps to a dedicated bidirectional TCP connection No controller-wide sequencing No controller-wide reassembly constraints

8 4K IOPs Scalability iSCSI is serialized heavily and cannot scale with the number of threads NVMe™/TCP scales very well reaching over 2M 4K IOPs

9 Performance: Read vs. Write I/O Queue Separation
Common problem with TCP/IP is head-of-queue (HOQ) blocking For example, a small 4KB Read is blocked behind a large 1MB Write to complete data transfer Linux supports Separate Queue mappings since v5.0 Default Queue Map Read Queue Map Poll Queue Map NVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking. In the Future can contain Priority Based Queue Arbitration to eliminate even further

10 Performance: Read vs. Write I/O Queue Separation
NVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking. Future: Priority Based Queue Arbitration can reduce impact even further

11 Mixed Workloads Test Test the impact of Large Write I/O on Read Latency 32 “readers” issuing synchronous READ I/O 1 Writer that issues 1MB QD=16 iSCSI Latencies collapse in the presence of Large Writes Heavy serialization over a single channel NVMe™/TCP is very much on-par with NVMe/RDMA

12 Commercial Performance
Software NVMe™/TCP controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts

13 Commercial Performance – Mixed Workloads
Software NVMe™/TCP Controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts

14 Slab, sendpage and kernel hardening
We never copy buffers NVMe™/TCP TX side (not even PDU headers) As a proper blk_mq driver, Our PDU headers were preallocated in advance PDU headers were allocated as normal Slab objects Can a Slab original allocation be sent to the network with Zcopy? Linux-mm seemed to agree we can (Discussion)... But, every now and then, under some workloads the kernel would panic... kernel BUG at mm/usercopy.c:72! CPU: 3 PID: 2335 Comm: dhclient Tainted: G O el7.elrepo.x86_64 #1 ... Call Trace: copy_page_to_iter_iovec+0x9c/0x180 copy_page_to_iter+0x22/0x160 skb_copy_datagram_iter+0x157/0x260 packet_recvmsg+0xcb/0x460 sock_recvmsg+0x3d/0x50 ___sys_recvmsg+0xd7/0x1f0 __sys_recvmsg+0x51/0x90 SyS_recvmsg+0x12/0x20 entry_SYSCALL_64_fastpath+0x1a/0xa5

15 Slab, sendpage and kernel hardening
Root Cause: In high queue depth, TCP stack coalesce PDU headers into a single fragment At the same time, we have userspace programs applying bpf packet filters (in this case dhclient) Kernel Hardening applies heuristics to catch exploits: In this case, panic if usercopy attempts to copy skbuff that contains a fragment that cross the Slab object boundary Resolution: Don’t allocate PDU headers from the Slab allocators Instead use a queue private page_frag_cache This resolved the panic issue But also improved the page referencing efficiency on the TX path!

16 Ecosystem Linux kernel support is upstream since v5.0 (both host and NVM subsystem) SPDK support (both host and NVM subsystem) NVMe™ compliance program Interoperability testing started at UNH-IOL in the Fall of 2018 Formal NVMe compliance testing at UNH-IOL planned to start in the Fall of 2019 For more information see:

17 Summary NVMe™/TCP is a new NVMe-oF™ transport
NVMe/TCP is specified by TP 8000 (available at Since TP 8000 is ratified, NVMe/TCP is officially part of NVMe-oF 1.0 and will be documented as part of the next NVMe-oF specification release NVMe/TCP offers a number of benefits Works with any fabric that support TCP/IP Does not require a “storage fabric” or any special hardware Provides near direct attached NAND SSD performance Scalable solution that works within a data center or across the world

18 Storage Performance Development Kit
User-space C Libraries that implement a block stack Includes an NVMe™ driver Full featured block stack Open Source 3-clause BSD Asynchronous, event loop, polling design strategy Very different than traditional OS stack (but very similar to the new io_uring in Linux) 100% focus on performance (latency and bandwidth)

19 NVMe-oF™ History NVMe over Fabrics Host NVMe™ over Fabrics Target
July 2016: Initial Release (RDMA Transport) July 2016 – Oct 2018: Hardening, Feature Completeness Performance Improvements (scalability) Design changes (introduction of poll groups) Jan 2019: TCP Transport Compatible with Linux kernel Based on POSIX sockets (option to swap in VPP) NVMe over Fabrics Host December 2016: Initial Release (RDMA Transport) July 2016 – Oct 2018: Hardening, Feature Completeness Performance Improvements (zero copy) Jan 2019: TCP Transport Compatible with Linux kernel Based on POSIX sockets (option to swap in VPP)

20 NVMe-oF™ Target Design Overview
Target spawns one thread per core which runs an event loop Event loop is called a “poll group” New connections (sockets) are assigned to a poll group when accepted Poll group polls the sockets it owns using epoll/kqueue for incoming requests Poll group polls dedicated NVMe™ queue pairs on back end for completions (indirectly, via block device layer) I/O processing is run-to-completion mode and entirely lock-free.

21 Transport Abstraction
Adding a New Transport Transports are abstracted away from the common NVMe-oF™ code via a plugin system Plugins are a set of function pointers that are registered as a new transport. TCP Transport implemented in lib/nvmf/tcp.c Transport Abstraction Socket operations are also abstracted behind a plugin system POSIX sockets and VPP supported FC? RDMA TCP Posix VPP

22 Future Work Better socket syscall batching!
Calling epoll_wait, readv, and writev over and over isn’t effective. Need to batch the syscalls for a given poll group. Abuse libaio’s io_submit? io_uring? Can likely reduce number of syscalls by a factor of 3 or 4. Better integration with VPP (eliminate a copy) Integrate with TCP acceleration available in NICs NVMe-oF offload support

23


Download ppt "NVMe™/TCP Development Status and a Case study of SPDK User Space Solution 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Sagi Grimberg,"

Similar presentations


Ads by Google