RDMA Extensions for Persistency and Consistency

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Enabling Coherent FPGA Acceleration Allan Cantle, President & Founder Nallatech Join the conversation at #OpenPOWERSummit1 #OpenPOWERSummit.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
The SNIA NVM Programming Model
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
NVM Programming Model. 2 Emerging Persistent Memory Technologies Phase change memory Heat changes memory cells between crystalline and amorphous states.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
SRP Update Bart Van Assche,.
Copyright DataDirect Networks - All Rights Reserved - Not reproducible without express written permission Adventures Installing Infiniband Storage Randy.
Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
The NE010 iWARP Adapter Gary Montry Senior Scientist
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
iSER update 2014 OFA Developer Workshop Eyal Salomon
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Local and Remote byte-addressable NVDIMM High-level Use Cases
Internet Protocol Storage Area Networks (IP SAN)
Under the Hood with NVMe over Fabrics
2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Tgt: Framework Target Drivers FUJITA Tomonori NTT Cyber Solutions Laboratories Mike Christie Red Hat, Inc Ottawa Linux.
Persistent Memory (PM)
Enhancements for Voltaire’s InfiniBand simulator
Platform as a Service (PaaS)
Balazs Voneki CERN/EP/LHCb Online group
Video Security Design Workshop:
Platform as a Service (PaaS)
Failure-Atomic Slotted Paging for Persistent Memory
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Alternative system models
Direct Attached Storage and Introduction to SCSI
Persistent Memory over Fabrics
What is Fibre Channel? What is Fibre Channel? Introduction
Sebastian Solbach Consulting Member of Technical Staff
Windows Server* 2016 & Intel® Technologies
HPE Persistent Memory Microsoft Ignite 2017
Introduction to Networks
Chapter 6: Network Layer
Chet Douglas – DCG NVMS SW Architecture
Chapter 3: Windows7 Part 4.
Direct Attached Storage and Introduction to SCSI
Enabling the NVMe™ CMB and PMR Ecosystem
CMSC 611: Advanced Computer Architecture
Microsoft Ignite NZ October 2016 SKYCITY, Auckland.
SCSI over PCI Express convergence
HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh.
OpenFabrics Alliance An Update for SSSI
Direct Memory Access Disk and Network transfers: awkward timing:
Specialized Cloud Architectures
Windows Virtual PC / Hyper-V
Application taxonomy & characterization
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Open Source Activity Showcase Computational Storage SNIA SwordfishTM
Accelerating Applications with NVM Express™ Computational Storage 2019 NVMe™ Annual Members Meeting and Developer Day March 19, 2019 Prepared by Stephen.
NVMe.
Chapter 13: I/O Systems.
Microsoft Virtual Academy
Microsoft Virtual Academy
© 2016 Global Market Insights, Inc. USA. All Rights Reserved Ethernet Storage Market Size Growth During Forecast Period.
Factors Driving Enterprise NVMeTM Growth
Presentation transcript:

RDMA Extensions for Persistency and Consistency Idan Burstein & Rob Davis Mellanox Technologies

Persistent Memory (PM) Persistent Memory (PM) Benefits Considerably faster than NAND Flash Performance accessible via PCIe or DDR interfaces Lower cost/bit than DRAM Significantly denser than DRAM

PM Needs High Performance Fabric for Storage NAND Read Latency ~100ns ~20us Write Latency ~500ns ~50us

Applications Benefitting from PM Compute Disaggregation Scale out file/object systems Software Defined Storage In-Memory Database Big Data & NoSQL workloads Key-Value Store Machine learning Hyperconverged Infrastructure Here are a couple applications I think can take advantage of PM if we can network it with a very small performance penalty: The first is Hyper converged infrastructure – where the storage is sale out architecture and the applications run on the same servers as the storage. Application performance is best when storage is local of course. But this is often not the case. And writes to storage need to be replicated remotely even if the storage is local.

Compute Disaggregation Network Compute PM Apps Requires High Performance Network

How Do We Get There? Standards Work in Progress Protocol: SNIA NVM Programming Model TWIG SNIA NVM PM Remote Access for High Availability Other standards efforts – IBTA, others Protocol: RDMA is the obvious choice 2015 FMS demos: PMC(7us 4KB IO) HGST(2.3us)

Core Requirements for Networked PM NAND Read Latency ~100ns ~20us Write Latency ~500ns ~50us Network PM is really fast Needs ultra low-latency networking PM has very high bandwidth Needs ultra efficient, transport offload, high bandwidth networks Remote accesses must not add significant latency PM networking requires: Predictability, Fairness, Zero Packet Loss InfiniBand/Ethernet switches and adapters must deliver all of these First of all why do we need to do something special to network PM? Can’t we just use the existing storage networking technology? The answer is yes….but if we take that approach we can not take full advantage of PM performance Even when compared to NAND based SSD performance the difference is amazing The industry is currently finalizing version 1 of a specialized solution for networking NAND devices called NVMf At my company we believe we must also do a specialized industry solution to network PM

Peer-Direct NVRAM / RDMA Fabrics Proof of Concept Platform Mellanox RoCE PMC NVRAM Card Peer-Direct I/O operations bypass host CPU 7us latency 4KB IO Client-Server to Mem

HGST PCM Remote Access Demo HGST live demo at Flash Memory Summit PCM based Key-Value store over 100Gb/s RDMA

Application Level Performance PCM is slightly slower than DRAM but … Equivalent application level perf

Performance with NVMf Community Driver Today Two compute nodes ConnectX4-LX 25Gbps port One storage node ConnectX4-LX 50Gbps port 4 X Intel NVMe devices (P3700/750 series) Nodes connected through switch We see similar added fabric latency performance with Intel SPDK Added fabric latency ~12us

Remote PM Extensions Must be Implemented in Many Places Data needs to be pushed to Target Remote Cache Flush Needed NVMf iSER NFS o RDMA SMB Direct

Software APIs for Persistent Memory NVM Programming Model Describes application visible behaviors Guarantees, Error semantics Allows API’s to align with OS’s Exposes opportunities in networks and processors Accelerate the availability of software that enables NVM (Non-Volatile Memory) hardware. Generic FS API PMEM Aware FS Implementations Linux DAX extensions PMEM.IO userspace library Bypass the OS for persistency guarantees

NVM.PM.FILE Basic Semantics MAP Associates memory address with a file descriptor and offset Specific address may be requested SYNC Commit the read/write transactions to persistency Types Optimize flush Optimize flush and verify

Remote Direct Memory Access (RDMA)

RDMA Transport built on simple primitives deployed for 15 years in the industry Queue Pair (QP) – RDMA communication end point Connect for establishing connection mutually RDMA Registration of memory region (REG_MR) for enabling virtual network access to memory SEND and RCV for reliable two-sided messaging RDMA READ and RDMA WRITE for reliable one-sided memory to memory transmission Reliability Delivery Once In order

RDMA WRITE Semantics RDMA Acknowledge (and Completion) Guarantee that Data has been successfully received and accepted for execution by the remote HCA Doesn’t guarantee data has reached remote host memory Further Guarantees implemented by ULP therefore requires interrupt and add latency to the transaction Reliability extensions are required

RDMA Atomics

SEND/RCV

RDMA READ

Ordering Rules

RDMA Storage Protocols

Variety of RDMA Storage Protocols SAN: Many options Variety of RDMA Storage Protocols Application File System SMB Direct RDMA NFS Block Layer SCSI iSCSI iSER SRP RDMA(IB) NVME NVMe-OF Similar Exchange Model

Example: NVMe over Fabrics

NVMe-OF IO WRITE (PULL) Host SEND carrying Command Capsule (CC) Upon RCV completion Free the data buffer (invalidate if needed) Subsystem Upon RCV Completion Allocate Memory for Data Post RDMA READ to fetch data Upon READ Completion Post command to backing store Upon SSD completion Send NVMe-OF RC Free memory Upon SEND Completion Free CC and completion resources Additional Round Trip to Fetch Data

NVMe-OF IO READ Interrupts Required for Reading Host Subsystem Post SEND carrying Command Capsule (CC) Subsystem Upon RCV Completion Allocate Memory for Data Post command to backing store Upon SSD completion Post RDMA Write to write data back to host Send NVMe RC Upon SEND Completion Free memory Free CC and completion resources Interrupts Required for Reading

NVMe-OF IO WRITE In-Capsule (PUSH) Host Post SEND carrying Command Capsule (CC) Upon RCV completion Free the data buffer (invalidate if needed) Subsystem Upon RCV Completion Allocate Memory for Data Upon SSD completion Send NVMe-OF RC Free memory Upon SEND Completion Free CC and completion resources Large Buffers Should be Posted in Advance Data is being places in undeterministic location with respect to the requester

Summary Storage over RDMA Model Overview Commands and Status Data SEND Data RDMA initiated by Storage Target Protocol Design Considerations Latency of storage device is orders of magnitude higher than network latency Storage is usually not RDMA addressable Scalability of intermediate memory

Latency Breakdown for Accessing Storage Traditional storage latency are ~10usec added latency Round trip time (RTT) Interrupt overheads Memory management and processing Operating system call RDMA latency may go down to 0.7 usec

Latency Breakdown for Accessing Storage Persistent Memory Non-Volatile memory technologies are reducing the latency to <1usec Spending time on RTT and interrupts becomes unacceptable and unnessacary Memory mapped for DMA is not a staging buffer, it is the storage media PMEM storage should be exposed directly to RDMA access to meet its performance goals

Remote Access to Persistent Memory Exchange Model

Proposal – RDMA READ Directly From PMEM Utilized the RDMA capabilities to directly with storage DMA Model RPC Model

Proposal – RDMA Write Directly to PMEM Utilized the RDMA capabilities to directly communicate with storage RPC Model DMA Model Resolves the latency, memory scalability and processing overhead while communicating with PMEM

DMA Model is Mostly Relevant When… Storage Access Time becomes comparable to Network Roundtrip Time Storage Device can be Memory Mapped Directly as RDMA WRITE Target Buffer Main challenge: RDMA WRITE/Atomics reliability guarantees

RDMA Extensions for Persistency and Consistency

RDMA Extensions for Memory Persistency and Consistency RDMA FLUSH Extension New RDMA operation Higher Performance and Standard Straightforward Evolutionary fit to RDMA Transport Target guarantees Consistency Persistency

RDMA FLUSH Operation System Implication System level implication may be: Caching efficiency Persistent memory bandwidth / durability Performance implications for the flush operation The new reliability semantics design should consider these implications during the design of the protocol These implications are the base for our requirement

Figure: Flush is a new transport operation Baseline Requirement Enable FLUSH the data selectively to persistency FLUSH the data only once consistently Responder should be capable of provisioning which memory regions could execute FLUSH and are targeted to PMEM Enable amortization of the FLUSH operation over multiple WRITE Utilize RDMA ordering rules for ordering guarantees Figure: Flush is a new transport operation

Use Case: RDMA to PMEM for High Availability

Use Case: RDMA to PMEM for High Availability MAP Memory address for the file Memory address + Registration of the replication SYNC Write all the “dirty” pages to remote replication FLUSH the writes to persistency UNMAP Invalidate the registered pages for replication

Conclusions Persistent memory performance and functionality brings a paradigm shift to storage technology Effective remote access to such media will require revision of the current RDMA storage protocols IBTA extends the transport capabilities of RDMA to support placement guarantees to align these into a single efficient standard Join us to IBTA if you want to contribute Help us driving the standards in other groups (PCIe, JDEC, NVMe, etc.)

idanb@mellanox.com robd@Mellanox.com Thanks! idanb@mellanox.com robd@Mellanox.com