Persistent Memory over Fabrics An Application-centric view

Slides:



Advertisements
Similar presentations
Welcome to the 10 th OFA Workshop #OFADevWorkshop.
Advertisements

The SNIA NVM Programming Model
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
5/8/2006 Nicole SAN Protocols 1 Storage Networking Protocols Nicole Opferman CS 526.
RDMA in Virtualized and Cloud Environments #OFADevWorkshop Aaron Blasius, ESXi Product Manager Bhavesh Davda, Office of CTO VMware.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
NVM Programming Model. 2 Emerging Persistent Memory Technologies Phase change memory Heat changes memory cells between crystalline and amorphous states.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
SRP Update Bart Van Assche,.
Windows RDMA File Storage
Discussing an I/O Framework SC13 - Denver. #OFADevWorkshop 2 The OpenFabrics Alliance has recently undertaken an effort to review the dominant paradigm.
Page 1 Overview of the OpenFabrics Alliance and OpenFabrics Enterprise Distribution (OFED™) Open Source, High Performance and High Efficiency Software.
Roland Dreier Technical Lead – Cisco Systems, Inc. OpenIB Maintainer Sean Hefty Software Engineer – Intel Corporation OpenIB Maintainer Yaron Haviv CTO.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
OpenFabrics 2.0 or libibverbs 1.0 Sean Hefty Intel Corporation.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
OFED Usage in VMware Virtual Infrastructure Anne Marie Merritt, VMware Tziporet Koren, Mellanox May 1, 2007 Sonoma Workshop Presentation.
IWARP Status Tom Tucker. 2 iWARP Branch Status  OpenFabrics SVN  iWARP in separate branch in SVN  Current with trunk as of SVN 7626  Support for two.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Processes Introduction to Operating Systems: Module 3.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
CMS week, June 2002, CERN 1 First P2P Measurements on Infiniband Luciano Berti INFN Laboratori Nazionali di Legnaro.
iSER update 2014 OFA Developer Workshop Eyal Salomon
OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Local and Remote byte-addressable NVDIMM High-level Use Cases
Datacenter Fabric Workshop NFS over RDMA Boris Shpolyansky Mellanox Technologies Inc.
OFA OpenFabrics Interfaces Project Data Storage/Data Access subgroup November 2015 Summarizing NVM Usage Models for DS/DA.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Open Fabrics Interfaces Software Sean Hefty - Intel Corporation.
Microsoft Advertising 16:9 Template Light Use the slides below to start the design of your presentation. Additional slides layouts (title slides, tile.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.
2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.
Progress in Standardization of RDMA technology Arkady Kanevsky, Ph.D Chair of DAT Collaborative.
OFA OpenFabrics Interfaces Project Data Storage/Data Access subgroup October 2015 Summarizing NVM Usage Models for DS/DA.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
OFA OpenFabrics Interfaces Project Data Storage/Data Access subgroup October 2015 Summarizing NVM Usage Models for DS/DA.
Introduction to Operating Systems Concepts
Virtual Machine Monitors
Module 3: Operating-System Structures
Application Access to Persistent Memory – The State of the Nation(s)!
A Brief Introduction to OpenFabrics Interfaces - libfabric
Using non-volatile memory (NVDIMM-N) as block storage in Windows Server 2016 Tobias Klima Program Manager.
Persistent Memory over Fabrics
What is Fibre Channel? What is Fibre Channel? Introduction
Advancing open fabrics interfaces
Microsoft Build /12/2018 5:05 AM Using non-volatile memory (NVDIMM-N) as byte-addressable storage in Windows Server 2016 Tobias Klima Program Manager.
RDMA Extensions for Persistency and Consistency
Chet Douglas – DCG NVMS SW Architecture
Enabling the NVMe™ CMB and PMR Ecosystem
OpenFabrics Interfaces: Past, present, and future
CMSC 611: Advanced Computer Architecture
OpenFabrics Alliance An Update for SSSI
Storage Networking Protocols
Chapter 3: Operating-System Structures
Today’s agenda Hardware architecture and runtime system
Operating Systems Lecture 3.
Application taxonomy & characterization
Accelerating Applications with NVM Express™ Computational Storage 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Prepared by Stephen.
NVMe.
kfabric – network options
Chapter 13: I/O Systems.
Factors Driving Enterprise NVMeTM Growth
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Persistent Memory over Fabrics An Application-centric view Paul Grun Cray, Inc OpenFabrics Alliance Vice Chair

Agenda OpenFabrics Alliance Intro OpenFabrics Software Introducing OFI - the OpenFabrics Interfaces Project OFI Framework Overview – Framework, Providers Delving into Data Storage / Data Access Three Use Cases A look at Persistent Memory

OpenFabrics Alliance The OpenFabrics Alliance (OFA) is an open source-based organization that develops, tests, licenses, supports and distributes OpenFabrics Software (OFS). The Alliance’s mission is to develop and promote software that enables maximum application efficiency by delivering wire-speed messaging, ultra-low latencies and maximum bandwidth directly to applications with minimal CPU overhead.   https://openfabrics.org/index.php/organization.html

OpenFabrics Alliance – selected statistics Founded in 2004 Leadership Susan Coulter, LANL – Chair Paul Grun, Cray Inc. – Vice Chair Bill Lee, Mellanox Inc. – Treasurer Chris Beggio, Sandia – Secretary (acting) 14 active Directors/Promoters (Intel, IBM, HPE, NetApp, Oracle, Unisys, Nat’l Labs…) Major Activities Develop and support open source network stacks for high performance networking OpenFabrics Software - OFS Interoperability program (in concert with the University of New Hampshire InterOperability Lab) Annual Workshop – March 27-31, Austin TX Technical Working Groups OFIWG, DS/DA, EWG, OFVWG, IWG https://openfabrics.org/index.php/working-groups.html (archives, presentations, all publicly available) today’s focus

OpenFabrics Software (OFS) Open Source APIs and software for advanced networks Emerged along with the nascent InfiniBand industry in 2004 Soon thereafter expanded to include other IB-based networks such as RoCE and iWARP Wildly successful, to the point that RDMA technology is now being integrated upstream. Clearly, people like RDMA. OFED distribution and support: managed by the Enterprise Working Group – EWG Verbs development: managed by the OpenFabrics Verbs Working Group - OFVWG api verbs uverbs kverbs IB IB iWARP RoCE phy wire IB ENET ENET

Historically, network APIs have been developed ad hoc as part of the development of a new network. To wit: today’s Verbs API is the implementation of the verbs semantics specified in the InfiniBand Architecture. But what if a network API was developed that catered specifically to the needs of its consumers, and the network beneath it was allowed to develop organically? What would be the consumer requirements? What would such a resulting API look like?

Introducing the OpenFabrics Interfaces Project OpenFabric Interfaces Project (OFI) Proposed jointly by Cray and Intel, August 2013 Chartered by the OpenFabrics Alliance, w/ Cray and Intel as co-chairs Objectives ‘Transport neutral’ network APIs ‘Application centric’, driven by application requirements “Transport Neutral, Application-Centric” OFI Charter Develop, test, and distribute: An extensible, open source framework that provides access to high-performance fabric interfaces and services Extensible, open source interfaces aligned with ULP and application needs for high-performance fabric services OFIWG will not create specifications, but will work with standards bodies to create interoperability as needed 7

OpenFabrics Software (OFS) api verbs ofi Result: An extensible interface driven by application requirements Support for multiple fabrics Exposes an interface written in the language of the application uverbs kverbs libfabric kfabric IB . . . provider provider provider IB iWARP RoCE . . . phy wire IB ENET ENET fabric fabric fabric

OFI Framework … OFI consists of two major components: a set of defined APIs (and some functions) – libfabric, kfabric a set of wire-specific ‘providers’ - ‘OFI providers’ Think of the ‘providers’ as an implementation of the API on a given wire Application MPI a series of interfaces to access fabric services API Service I/F Service I/F Service I/F Service I/F OFI Providers … hardware Provider A Provider B Provider C Provider exports the services defined by the API NIC NIC NIC

OpenFabrics Interfaces - Providers All providers expose the same interfaces, None of them expose details of the underlying fabric. ofi ENET sockets provider ARIES GNI provider ENET usnic provider IB verbs provider ENET verbs provider (RoCEv2, iWARP) libfabric kfabric libfabric / kfabric . . . … provider provider provider . . . fabric fabric fabric Current providers: sockets, verbs, usnic, gni, mlx, psm, psm2, udp, bgq

OFI Framework – a bit more detail

OFI Project Overview – work group structure Legacy Apps Data Analysis Data Storage, Data Access Distributed and Parallel Computing Sockets apps IP apps Structured data Unstructured data Data Storage - object, file, block - storage class memory Data Access - remote persistent memory Msg Passing MPI applications Shared memory - PGAS languages DS/DA WG OFI WG Application-centric design means that the working groups are driven by use cases: Data Storage / Data Access, Distributed and Parallel Computing…

OFI Project Overview – work group structure Legacy Apps Data Analysis Data Storage, Data Access Distributed and Parallel Computing Sockets apps IP apps Structured data Unstructured data Data Storage - object, file, block - storage class memory Data Access - remote persistent memory Msg Passing MPI applications Shared memory - PGAS languages kernel mode ‘kfabric’ user mode ‘libfabric’ (Not quite right because you can imagine user mode storage, e.g. CEPH)

OFI Project Overview – work group structure Legacy Apps Data Analysis Data Storage, Data Access Distributed and Parallel Computing Sockets apps IP apps Structured data Unstructured data Data Storage - object, file, block - storage class memory Data Access - remote persistent memory Msg Passing MPI applications Shared memory - PGAS languages kernel mode ‘kfabric’ user mode ‘libfabric’ Since the topic today is Persistent Memory…

Data Storage, Data Access? DS - object, file, block - storage class mem DA - persistent memory storage client user or kernel app POSIX I/F load/store, memcopy… file system virtual switch DIMM DIMM NVDIMM NVDIMM NVM devices (SSDs…) Reminder: libfabric: User mode library kfabric : Kernel modules provider NIC remote storage class memory, persistent memory

DS/DA’s Charter Data Access Data Storage local mem remote PM remote storage tolerance for latency pS uS mS Data Access Data Storage At what point does remote storage begin to look instead like remote persistent memory? How would applications treat a pool of remote persistent memory? Key Use Cases: Lustre NVMe Persistent Memory

Data Storage Example - LNET Basic LNET stack Cray Aries stack TCP/IP stack IB stack kfabric stack LNET app Lustre, DVS… Lustre Lustre Lustre An LND for every fabric ptlrpc … … … … LNET … … … … LND gnilnd socklnd o2iblnd kfabric N/W API api sockets verbs provider N/W hardware Gemini Ethernet IB, Ethernet fabric Backport LNET to kfabric, and we’re done. Any kfabric provider after that will work with Lustre.

Data Storage Example – NVMe kverbs IB ENET iWARP verbs RoCE kfabric ofi provider VERBS provider fabric . . .

Data Storage – enhanced APIs for storage kernel application VFS / Block I/O / Network FS / LNET Places where the OFA can help: kfabric as a second native API for NVMe kfabric as a possible ‘LND’ for LNET SCSI NVMe iSCSI NVMe/F kfabric sockets kverbs provider NIC HCA NIC, RNIC fabric-specific device IP IB RoCE, iWarp fabric

A look at Persistent Memory local mem remote PM remote storage latency tolerance pS uS mS Applications tolerate long delays for storage, but assume very low latency for memory Storage systems are generally asynchronous, target driven, optimized for cost Memory systems are synchronous, and highly optimized to deliver the lowest imaginable latency with no CPU stalls Persistent Memory over fabrics is somewhere in between: Much faster than storage, but not as fast as local memory How to treat PM over fabrics? Build tremendously fast remote memory networks, or Find ways to hide the latency, making it look for the most part like memory, but cheaper and remote

Two interesting PM use cases Consumers that handle remote memory naturally… SHMEM-based applications, PGAS languages… Requirements: optimized completion semantics to indicate that data is globally visible, semantics to commit data to persistent memory completion semantics to indicate persistence …and those that don’t (but we wish they did) HA use case Requirements: exceedingly fast remote memory bus, or a way to hide latency the latter requires an ability to control when data is written to PM

PM high availability use case file access cache writes flush user persistent memory user kernel load/store mem mapping memory controller file system NVDIMM NVDIMM DIMM DIMM store, store… libfabric kfabric provider provider NIC persistent memory

Data Access – completion semantics I/O bus mem bus user app PM device(s) For ‘normal’ fabrics, the responder returns an ACK when the data has been received by the end point. It may not be globally visible, and/or It may not yet be persistent Need an efficient mechanism for indicating when the data is globally visible, and is persistent PM client PM server provider NIC NIC

Data Access – key fabric requirements Objectives: Client controls commits to persistent memory Distinct indications of global visibility, persistence Optimize protocols to avoid round trips Requester Responder app memory persistent memory I/O pipeline fi_send, fi_rma requester (consumer) EP EP completion data is received ack-a Caution: OFA does not define wire protocols, it only defines the semantics seen by the consumer data is visible ack-b data is persistent ack-c

Possible converged I/O stack kernel application user app user app kernel application VFS / Block Layer byte access byte access VFS / Block I/O / Network FS / LNET SRP, iSER, NVMe/F, NFSoRDMA, SMB Direct, LNET/LND,… iSCSI SCSI NVMe ulp libfabric kfabric sockets * provider provider kverbs PCIe mem bus PCIe HBA SSD NVDIMM SSD fabric-specific device NIC, RNIC HCA NIC local I/O local byte addressable remote byte addressable remote I/O * kfabric verbs provider exists today

Discussion – “Between Two Ferns” Doug Voigt - SNIA NVM PM TWG Chair Paul Grun – OFA vice Chair, co-chair OFIWG, DS/DA Are we going in the right direction? Next steps?

Thank You

Backup

kfabric API – very similar to existing libfabric Consumer APIs kfi_getinfo() kfi_fabric() kfi_domain() kfi_endpoint() kfi_cq_open() kfi_ep_bind() kfi_listen() kfi_accept() kfi_connect() kfi_send() kfi_recv() kfi_read() kfi_write() kfi_cq_read() kfi_cq_sread() kfi_eq_read() kfi_eq_sread() kfi_close() … Provider APIs kfi_provider_register() During kfi provider module load a call to kfi_provider_register() supplies the kfi api with a dispatch vector for kfi_* calls. kfi_provider_deregister() During kfi provider module unload/cleanup kfi_provider_deregister() destroys the kfi_* runtime linkage for the specific provider (ref counted). kfabric API http://downloads.openfabrics.org/WorkGroups/ofiwg/dsda_kfabric_architecture/

Data Storage – NVMe/F today kernel application VFS / Block I/O / Network FS / LNET NVMe/F is an extension, allowing access to an NVMe device over a fabric NVMe/F leverages the characteristics of verbs to good effect for verbs-based fabrics SCSI NVMe iSCSI NVMe/F sockets kverbs NIC HCA NIC, RNIC IP IB RoCE, iWarp