2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.

Slides:



Advertisements
Similar presentations
Pipeline transfer testing. The purpose of pipeline transfer increase the bandwidth for synchronous slave peripherals that require several cycles to return.
Advertisements

Computer Organization and Architecture
Lecture Objectives: 1)Explain the limitations of flash memory. 2)Define wear leveling. 3)Define the term IO Transaction 4)Define the terms synchronous.
12/02/14 Chet Douglas, DCG Crystal Ridge PE SW Architecture
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
04/16/2010CSCI 315 Operating Systems Design1 I/O Systems Notice: The slides for this lecture have been largely based on those accompanying an earlier edition.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
I/O Systems CS 3100 I/O Hardware1. I/O Hardware Incredible variety of I/O devices Common concepts ◦Port ◦Bus (daisy chain or shared direct access) ◦Controller.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
The SNIA NVM Programming Model
1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.
I/O Systems CSCI 444/544 Operating Systems Fall 2008.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
CERN openlab Open Day 10 June 2015 KL Yong Sergio Ruocco Data Center Technologies Division Speeding-up Large-Scale Storage with Non-Volatile Memory.
NVM Programming Model. 2 Emerging Persistent Memory Technologies Phase change memory Heat changes memory cells between crystalline and amorphous states.
Interrupts Signal that causes the CPU to alter its normal flow on instruction execution ◦ frees CPU from waiting for events ◦ provides control for external.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
PCI Team 3: Adam Meyer, Christopher Koch,
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
I/O Systems I/O Hardware Application I/O Interface
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Chapter 13: I/O Systems. 13.2/34 Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
EEE440 Computer Architecture
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Local and Remote byte-addressable NVDIMM High-level Use Cases
Chapter 13: I/O Systems Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 13: I/O Systems Overview I/O Hardware Application.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Datacenter Fabric Workshop NFS over RDMA Boris Shpolyansky Mellanox Technologies Inc.
© 2004, D. J. Foreman 1 Device Mgmt. © 2004, D. J. Foreman 2 Device Management Organization  Multiple layers ■ Application ■ Operating System ■ Driver.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Redmond Protocols Plugfest 2016 Tom Talpey SMB3 Extensions for Low Latency Storage Architect, Windows File Server.
1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.
Persistent Memory (PM)
Chapter 13: I/O Systems.
DIRECT MEMORY ACCESS and Computer Buses
Module 12: I/O Systems I/O hardware Application I/O Interface
Persistent Memory over Fabrics
Operating Systems (CS 340 D)
Microsoft Build /12/2018 5:05 AM Using non-volatile memory (NVDIMM-N) as byte-addressable storage in Windows Server 2016 Tobias Klima Program Manager.
SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
RDMA Extensions for Persistency and Consistency
Chet Douglas – DCG NVMS SW Architecture
Persistent memory support
Short Circuiting Memory Traffic in Handheld Platforms
HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh.
I/O Systems I/O Hardware Application I/O Interface
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Open Source Activity Showcase Computational Storage SNIA SwordfishTM
NVMe.
Chapter 13: I/O Systems.
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent memory Chet Douglas Intel Corporation

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM - Agenda  Problem statement  Platform HW Background  Visibility vs Durability  Local Node Data Durability Considerations  Remote Node Data Durability Consideration  “Appliance Method” for Remote Data Durability  “General Purpose Server Method” for Remote Data Durability  Modeling SW Latencies for Remote Data Durability  Performance Tradeoffs & Durability Mechanism Pros/Cons  Future Considerations  Takeaways 2

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – Problem Statement  Current RDMA HW and SW solutions do not take persistent memory (pmem) in to account and do not have an explicit mechanism for forcing RDMA Writes to remote pmem to be durable (power fail atomic)  HA and HPC pmem Use Cases require writes to remote memory between compute nodes utilizing RDMA to support remote data base log updates, remote data replication, etc  In the short term, SW only modifications are possible to make RDMA SW Applications persistent memory aware and to force written data to be persistent (this slide deck)  In the longer term, industry wide enabling through standards organizations is required to have specific vendor agnostic mechanisms for utilizing persistent memory with RDMA, requiring HW and SW changes (See Tom Talpey’s session) 3

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – HW Background  ADR - Asynchronous DRAM Refresh  Allows NVDIMMs to save DRAM contents to flash on power loss  ADR Domain – All data inside of the domain is protected by ADR and will make it to pmem before supercap power dies. The integrated memory controller (iMC) is currently inside of the ADR Domain.  IO Controller  Controls IO flow between PCIe devices and Main Memory  “Allocating write transactions”  PCI Root Port will utilize write buffers backed by cache  Data buffers naturally aged out of cach e to main memory  “Non-Allocating write transactions”  PCI Root Port Write transactions utilize buffers not backed by cache  Forces write data to move to the iMC without cache delay  Enable/Disable via BIOS setting  DDIO – Data Direct IO  Allows Bus Mastering PCI & RDMA IO to move data directly in/out of LLC Core Caches  Allocating Write transactions will utilize DDIO 4

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – Visibility vs Durability  Write Data Visibility Point – Once data makes it out of the IO controller, HW makes the data visible to all caches and cores, independent of DRAM or NVDIMM devices  Write Data Durability Point – Typically data is not considered durable until it has been successfully written by the Memory Controller to persistent media 5

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – Local Node Data Durability Considerations  First, Write Data must be flushed from CPU caches to the Memory Controller and optionally invalidated  Second, Write Data in the Memory Controller must be flushed to pmem

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – Remote Node Data Durability Considerations  Same steps to make write data durable on a local node, must be followed on a remote Target node  Signal to make data durable does not need to come from the Initiator node

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – “Appliance” Durability Mechanism  Use Non-Allocating Writes for the remote Target node RNIC to bypass CPU caches – Target node does not need to issue CLFLUSHOPTs  Follow a number of RDMA Writes with an RDMA Read at the point SW wishes to make data durable – Flushes PCI and Memory Controller pipelines  Utilize ADR so Target node does not need to issue PCOMMIT

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – “General Purpose Server” Durability Mechanism  Use default platform configuration & allocating write flows  Initiator SW issues RDMA Send at SW Durability Point and passes list of cache lines to make durable  Target SW Interrupt Handler is interrupted and issues CLFLUSHOPT for each cache line in the list to push data in to the Memory Controller followed by PCOMMIT to force write data to pmem media  Target SW Interrupt Handler issues RDMA Send back to the Initiator SW to complete the synchronous handshake

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – Modeling SW Latencies for Remote Durability Mechanisms  CLFLUSHOPT/SFENCE Model  Basic SW latency model for determining the order of magnitude for flushing write data from CPU caches to the Memory Controller and waiting for flushes to complete  Not trying to be cycle accurate for a non-blocking “CLFLUSHOPT” instruction which is not that interesting at this level  For a single cache line flush, use CLFLUSH/SFENCE as a proxy  For multiple cache lines to flush, use full line non-temporal (NT) stores as a proxy  NT stores flush each cache line but don't have the serializing behavior of CLFLUSH  _mm_stream_si128() streaming execution time can vary depending on CPU preparation and cache preloading  Measure time moving the entire data transfer amount so that variance in first cache line move are averaged out

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – Modeling SW Latencies for Remote Durability Mechanisms  PCOMMIT/SFENCE Model  Basic SW latency model for determining the order of magnitude for committing write data from the iMC to pmem  Not trying to be cycle accurate for a non-blocking “PCOMMIT” instruction which is not that interesting at this level  To bound the latency calculation the calculation can be bounded with a fraction of cache lines that have not been evicted by the time SW gets to committing the write data  Determine the amount of write data to be committed based on FractionOfNotEvCacheLines  Use the backend NVDIMM BW to calculate a basic latency based on write data not already evicted

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – Performance & Durability Mechanism Pros/Cons

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – Future Considerations  SNIA NVM Programming Model  Extend to address remote access to pmem  SNIA NVM PM Remote Access for High-Availability white paper  Already contains basic NVM.PM.FILE.OPTIMIZED_FLUSH concepts  Add Appendix or Extension  Durability mechanisms described here  Platform and HW capabilities to detect supported mechanisms Proposed NVM Programming Model updates for NVM.PM.FILE.OPTIMIZED_FLUSH from NVM PM Remote Access for High-Availability

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM – Key Takeaways  Inserting SW durability points in to existing RDMA based SW is not difficult  Prepare for future fabric protocol, HW (RNIC, CPU, chipset) & SW extensions to make remote persistent memory programming even easier and improve latency  Consider abstracting durability mechanisms behind a SW library such as Intel’s NVML, OFA libfabrics and libibverbs API  Get involved in SNIA NVM Programming TWG! Bigger changes coming for future extensions to RDMA, PMEM, and data durability