Tushar Gohad, Intel Moshe Levi, Mellanox Ivan Kolodyazhny, Mirantis

Slides:



Advertisements
Similar presentations
© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Differentiated I/O services in virtualized environments
Profit from the cloud TM Parallels Dynamic Infrastructure AndOpenStack.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
NVM Programming Model. 2 Emerging Persistent Memory Technologies Phase change memory Heat changes memory cells between crystalline and amorphous states.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
SRP Update Bart Van Assche,.
Roland Dreier Technical Lead – Cisco Systems, Inc. OpenIB Maintainer Sean Hefty Software Engineer – Intel Corporation OpenIB Maintainer Yaron Haviv CTO.
Copyright DataDirect Networks - All Rights Reserved - Not reproducible without express written permission Adventures Installing Infiniband Storage Randy.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.
The NE010 iWARP Adapter Gary Montry Senior Scientist
InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
NVMe & Modern PC and CPU Architecture 1. Typical PC Layout (Intel) Northbridge ◦Memory controller hub ◦Obsolete in Sandy Bridge Southbridge ◦I/O controller.
Srihari Makineni & Ravi Iyer Communications Technology Lab
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
MDC323B SMB 3 is the answer Ned Pyle Sr. PM, Windows Server
Rick Claus Sr. Technical Evangelist,
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
iSER update 2014 OFA Developer Workshop Eyal Salomon
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Technical Overview of Microsoft’s NetDMA Architecture Rade Trimceski Program Manager Windows Networking & Devices Microsoft Corporation.
Introduction to Exadata X5 and X6 New Features
Under the Hood with NVMe over Fabrics
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
System Storage TM © 2007 IBM Corporation IBM System Storage™ DS3000 Series Jüri Joonsaar Tartu.
Enhancements for Voltaire’s InfiniBand simulator
Virtualization for Cloud Computing
Balazs Voneki CERN/EP/LHCb Online group
TLDK overview Konstantin Ananyev 05/08/2016.
Video Security Design Workshop:
Joint Genome Institute
Current Generation Hypervisor Type 1 Type 2.
Input/Output Devices ENCE 360
Module 2: DriveScale architecture and components
Using non-volatile memory (NVDIMM-N) as block storage in Windows Server 2016 Tobias Klima Program Manager.
Persistent Memory over Fabrics
Windows Server* 2016 & Intel® Technologies
HPE Persistent Memory Microsoft Ignite 2017
Microsoft Build /12/2018 5:05 AM Using non-volatile memory (NVDIMM-N) as byte-addressable storage in Windows Server 2016 Tobias Klima Program Manager.
Ping-Sung Yeh, Te-Hao Hsu Conclusions Results Introduction
Introduction to Networks
RDMA Extensions for Persistency and Consistency
GGF15 – Grids and Network Virtualization
Module – 7 network-attached storage (NAS)
Enabling the NVMe™ CMB and PMR Ecosystem
Rob Davis, Mellanox Ilker Cebeli, Samsung
OpenFabrics Alliance An Update for SSSI
Storage Networking Protocols
iSCSI-based Virtual Storage System for Mobile Devices
Virtio/Vhost Status Quo and Near-term Plan
IBM Power Systems.
Windows Virtual PC / Hyper-V
Application taxonomy & characterization
NVMe™/TCP Development Status and a Case study of SPDK User Space Solution 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Sagi Grimberg,
Accelerating Applications with NVM Express™ Computational Storage 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Prepared by Stephen.
NVMe.
Chapter 13: I/O Systems.
Factors Driving Enterprise NVMeTM Growth
Leveraging NVMe-oFTM for Existing and New Applications
Openstack Summit November 2017
Presentation transcript:

Tushar Gohad, Intel Moshe Levi, Mellanox Ivan Kolodyazhny, Mirantis Cinder and NVMe-over-Fabrics Network-Connected SSDs with Local Performance Tushar Gohad, Intel Moshe Levi, Mellanox Ivan Kolodyazhny, Mirantis

Storage Evolution Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published specifications of in-market memory products against internal Intel specifications.

Intel 3D XPoint* Performance at QD=1

NVM Express (NVMe) Standardized interface for non-volatile memory, http://nvmexpress.org NVM Express is an interface specification optimized for PCI Express® based storage solutions, such as solid-state drives. The NVMe 1.0 specification defines a scalable architecture that unlocks the potential of PCIe-based SSDs. With a throughput improvement of 6x and reduced storage latency over 6 Gbps SATA SSD, the Intel SSD PCIe Family increases processor utilization while scaling to meet demand. NVM Express revolutionizes storage by delivering faster access to data and lowering latency and power consumption. NVM Express reduces latency, is achieved by eliminating the delay associated with memory adapters and optimizing the storage protocol that has limited SSD performance until now. NVMe reduces the number of layers in the storage protocol stack. Storage commands are processed with 60% fewer processor cycles and 60% less latency. These freed up processing cycle can now be used for real application workload, improving the efficiency of the processor. NVMe delivers higher input/output operations per second (IOPS) and reduces power consumption for a lower total cost of ownership. Segway: Intel sets itself apart from others by delivering the PCIe Family optimized for today’s mixed workload applications   Source: Intel. Other names and brands are property of their respective owners. Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published specifications of in-market memory products against internal Intel specifications.

NVMe: Best-in-Class IOPS, Lower/Consistent Latency 3x better IOPS vs SAS 12Gbps For the same #CPU cycles, NVMe delivers over 2X the IOPs of SAS! Lowest Latency of Standard Storage Interfaces Gen1 NVMe has 2 to 3x better Latency Consistency vs SAS Test and System Configurations: PCI Express* (PCIe*)/NVM Express* (NVMe) Measurements made on Intel® Core™ i7-3770S system @ 3.1GHz and 4GB Mem running Windows* Server 2012 Standard O/S, Intel PCIe/NVMe SSDs, data collected by IOmeter* tool. SAS Measurements from HGST Ultrastar* SSD800M/1000M (SAS), SATA S3700 Series. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Source: Intel Internal Testing.

Remote Access to Storage – iSCSI and NVMe-oF Target NVMe-oF* SCSI NVMe Devices Block Device Abstraction (BDEV) Network Disaggregated Cloud Deployment Model NVMe-over-Fabrics NVMe commands over storage networking fabric NVMe-oF supports various fabric transports RDMA (RoCE, iWARP) InfiniBand™ Fibre Channel Intel® Omni-Path Architecture Future Fabrics

NVMe and NVMe-oF Basics Namespaces - mapping of NVM Media to a formatted LBA range Subsystem Ports are associated with Physical Fabric Ports Multiple NVMe Controllers may be accessed through a single port NVMe Controllers are associated with one port Fabric Types; PCIe, RDMA (Ethernet RoCE/iWARP, InfiniBand™), Fibre Channel/FCoE

NVMe Subsystem Implementations including NVMe-oF

NVMe-oF: Local NVMe Performance The idea is to extend the efficiency of the local NVMe interface over a network fabric Ethernet or IB NVMe commands and data structures are transferred end to end Relies on RDMA for performance Bypassing TCP/IP For more Information on NVMe over Fabrics (NVMe-oF) http://www.nvmexpress.org/wp- content/uploads/NVMe_Over_Fabrics.pdf How NVMeOF Maintain the NVME Performance. So the NVME interface is very efficient and lightweight that moves the bottle neck form the disk to the network. To keep the same performance as local NMVE interface we need a fast with low latency protocols. And that is were RDMA comes to the rescue. We have RDAM over Converged Ethernet (RoCE) and RDAM over InfiniBand.

What Is RDMA? Remote Direct Memory Access (RDMA) Advance transport protocol (same layer as TCP and UDP) Main features Remote memory read/write semantics in addition to send/receive Kernel bypass / direct user space access Full hardware offload Secure, channel based IO Application advantage Low latency High bandwidth Low CPU consumption RoCE, iWARP Verbs: RDMA SW interface (equivalent to sockets) RDAM Remote version for DMA. DMA (Direct memory Access) DMA - allow you to read/write from Memory without utilize the CPU. RDAM is the same only with remote Server. So you can access Memory from server A to server B without utilizing the CPU. (need lossless fabric) The RDAM bypass all the kernel (TCP/IP) stack and transport layer is done in the RDAM NIC it self.

RDMA and NVMe: A Perfect Match Network Network NVMe and RDAM are both asynchronous protocals Queue base protocols. AS you can see in the diagram RDAM has Send Queues, Receive Queues and Completion Queues. NVME has ADMIN Submission Queues and Completion Queues for Controller Management and IO Submission and Completion for IO operation

Ethernet & InfiniBand RDMA Mellanox Product Portfolio Ethernet & InfiniBand RDMA End-to-End 25, 40, 50, 56, 100Gb NICs Cables Switches How NVMeOF Maintain the NVME Performance. So the NVME interface is very efficient and lightweight that moves the bottle neck form the disk to the network. To keep the same performance as local NMVE interface we need a fast with low latency protocols. And that is were RDMA comes to the rescue. We have RDAM over Converged Ethernet (RoCE) and RDAM over InfiniBand.

NVMe-oF – Kernel Initiator Uses nvme-cli package implement the kernel initiator side Connect to remote target nvme connect –t rdma –n <conn_nqn> –a <target_ip> –s <target_port> nvme list - to get all the nvme devices

NVMe-oF – Kernel Target Uses nvmetcli package implement the kernel target side nvme save <file_name>– to create new subsystem nvme restore – to load existing subsystems

NVMe-oF in Available from Rocky release (we hope ☺) Available with TripleO deployment Requires RDMA NICs Supports Kernel target Supports Kernel Initiator SPDK target is work in progress Work Credit: Ivan Kolodyazhny (Mirantis) – First POC with SPDK Maciej Szwed (Intel) - SPDK Target Hamdy Khadr, Moshe Levi (Mellanox) – Kernel Initiator and Target

NVMe-oF in

(Logical Volume Manager) NVMe-oF in First implementation of NVMe-over-Fabrics in OpenStack Target OpenStack Release: Rocky Nova Cinder New NVMe-oF Target Drv Kernel LVM Volume Drv Tenant VM New Horizon Client LVM (Logical Volume Manager) /dev/vda KVM nvmet NVMe-oF Target NVMe-oF Initiator NVMe-oF Data Path RDMA Capable Network Nova/Cinder Control Path

NVMeOF – Backend [nvme-backend] lvm_type = default volume_group = vg_nvme volume_driver = cinder.volume.drivers.lvm.LVMVolumeDriver volume_backend_name = nvme-backend target_helper = nvmet target_protocol = nvmet_rdma target_ip_address = 1.1.1.1 target_port = 4420 nvmet_port_id = 2 nvmet_ns_is = 10 target_prefix = nvme-subsystem-1

NVMeOF with # cat /home/stack/tripleo-heat-templates/environments/cinder-nvmeof-config.yaml parameter_defaults: CinderNVMeOFBackendName: 'tripleo_nvmeof' CinderNVMeOFTargetPort: 4420 CinderNVMeOFTargetHelper: 'nvmet' CinderNVMeOFTargetProtocol: 'nvmet_rdma' CinderNVMeOFTargetPrefix: 'nvme-subsystem' CinderNVMeOFTargetPortId: 1 CinderNVMeOFTargetNameSpaceId: 10 ControllerParameters: ExtraKernelModules: nvmet: {} nvmet-rdma: {} ComputeParameters: nvme: {} nvme-rdma: {}

NVMe-oF and SPDK Storage Performance Development Kit

Storage Performance Development Kit Scalable and Efficient Software Ingredients User space, lockless, polled-mode components Up to millions of IOPS per core Designed to extract maximum performance from non- volatile media Storage Performance Development Kit Storage Reference Architecture Optimized for latest generation CPUs and SSDs Open source composable building blocks (BSD licensed) Available via spdk.io

Benefits of using SPDK

Block Device Abstraction (bdev) SPDK Architecture Storage Protocols Integration NVMe-oF* Target iSCSI Target vhost-scsi Target vhost-blk Target Linux nbd RDMA Cinder NVMe SCSI VPP TCP/IP Storage Services Block Device Abstraction (bdev) QoS RocksDB BlobFS 3rd Party Logical Volumes GPT DPDK Encryption Ceph Blobstore QEMU NVMe Linux AIO Ceph RBD PMDK blk virtio scsi virtio blk Core Drivers NVMe Devices Intel® QuickData Technology Driver Application Framework NVMe-oF* Initiator NVMe* PCIe Driver

NVMe-oF Performance with SPDK NVMe* over Fabrics Target Features Realized Benefit Utilizes NVM Express* (NVMe) Polled Mode Driver Reduced overhead per NVMe I/O RDMA Queue Pair Polling No interrupt overhead Connections pinned to CPU cores No synchronization overhead Callouts How to read the chart: the bars are the IOPS, same for both SPDK and kernel. The markers and line are the # of CPU cores required – 30 for kernel vs. 3 for SPDK. Demonstrates scalability of SPDK NVMe-oF Target: 50Gbps per core, up to the limit of the network or disk Built on top of SPDK NVMe driver, which has software overhead of about 277ns per I/O RDMA Queue Polling eliminates interrupt software latency and improves consistency Pinning I/O workload to cores eliminates synchronization, “core granularity” of resource allocation. SPDK reduces NVMe over Fabrics software overhead up to 10x! System Configuration: Target system: Supermicro SYS-2028U-TN24R4T+, 2x Intel® Xeon® E5-2699v4 (HT off), Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, 12x Intel® P3700 NVMe SSD (800GB) per socket, -1H0 FW; Network: Mellanox* ConnectX-4 LX 2x25Gb RDMA, direct connection between initiators and target; Initiator OS: CentOS* Linux* 7.2, Linux kernel 4.10.0, Target OS (SPDK): Fedora 25, Linux kernel 4.9.11, Target OS (Linux kernel): Fedora 25, Linux kernel 4.9.11 Performance as measured by: fio, 4KB Random Read I/O, 2 RDMA QP per remote SSD, Numjobs=4 per SSD, Queue Depth: 32/job. SPDK commit ID: 4163626c5c

SPDK LVOL Backend for Openstack Cinder First implementation of NVMe-over- Fabrics in Openstack NVMe-oF Target Driver SPDK LVOL based SDS Storage Backend (Volume Driver) Provides High-performance Alternative to Kernel LVM and Kernel NVMe-oF Target Upstream Cinder PR# 564229 Target Openstack Release: Rocky Joint work by Intel, Mirantis, Mellanox

Demonstration Upcoming Rocky NVMe-oF Feature