On Demand Paging (ODP) Update Liran Liss Mellanox Technologies.

Slides:



Advertisements
Similar presentations
Paper by: Yu Li, Jianliang Xu, Byron Choi, and Haibo Hu Department of Computer Science Hong Kong Baptist University Slides and Presentation By: Justin.
Advertisements

Tutorial 8 March 9, 2012 TA: Europa Shang
CSCC69: Operating Systems
The Linux Kernel: Memory Management
Operating Systems ECE344 Ding Yuan Final Review Lecture 13: Final Review.
EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.
OPERATING SYSTEMS 8 – VIRTUAL MEMORY PIETER HARTEL 1.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
DAPL: Direct Access Transport Libraries Introduction and Example Yufei 10/01/2010.
Translation Buffers (TLB’s)
The SNIA NVM Programming Model
CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.
Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.
Memory-Mapped Files & Unified VM System Vivek Pai.
Signature Verbs Extension Richard L. Graham. Data Integrity Field (DIF) Used to provide data block integrity check capabilities (CRC) for block storage.
Operating Systems Chapter 8
Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.
CS 241 Fall 2007 System Programming 1 Virtual Memory Lawrence Angrave and Vikram Adve.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
RDMA Bonding Liran Liss Mellanox Technologies. Agenda Introduction Transport-level bonding RDMA bonding design Recovering from failure Implementation.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
Virtual Memory 1 Chapter 13. Virtual Memory Introduction Demand Paging Hardware Requirements 4.3 BSD Virtual Memory 4.3 BSD Memory Management Operations.
Virtual Memory. DRAM as cache What about programs larger than DRAM? When we run multiple programs, all must fit in DRAM! Add another larger, slower level.
Operating Systems Lecture November 2015© Copyright Virtual University of Pakistan 2 Agenda for Today Review of previous lecture Hardware (I/O, memory,
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Lecture 11 Page 1 CS 111 Online Working Sets Give each running process an allocation of page frames matched to its needs How do we know what its needs.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
iSER update 2014 OFA Developer Workshop Eyal Salomon
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Implementation.
Linux Kernel Development Memory Management Pavel Sorokin Gyeongsang National University
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Memory Management & Virtual Memory. Hierarchy Cache Memory : Provide invisible speedup to main memory.
Translation Lookaside Buffer
Lecture 11 Virtual Memory
Virtual Memory Chapter 7.4.
Virtual Memory User memory model so far:
Virtual Memory: Systems
Andrew Hanushevsky: Memory Mapped I/O
Linux device driver development
CSE 153 Design of Operating Systems Winter 2018
Introduction to Operating Systems
OS Virtualization.
CSCI206 - Computer Organization & Programming
O.S Lecture 13 Virtual Memory.
Virtual Memory: Systems
Chapter 9: Virtual-Memory Management
Virtual Memory: Systems
Making Virtual Memory Real: The Linux-x86-64 way
Page Replacement.
Virtual Memory Hardware
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Translation Buffers (TLB’s)
Virtual Memory Overcoming main memory size limitation
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Contents Memory types & memory hierarchy Virtual memory (VM)
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Remote Page Faults Over RDMA
CSE 153 Design of Operating Systems Winter 2019
Translation Buffers (TLBs)
Buddy Allocation CS 161: Lecture 5 2/11/19.
COMP755 Advanced Operating Systems
Review What are the advantages/disadvantages of pages versus segments?
CSE 542: Operating Systems
Memory Management & Virtual Memory
Threads CSE 2431: Introduction to Operating Systems
Presentation transcript:

On Demand Paging (ODP) Update Liran Liss Mellanox Technologies

Agenda Introduction Implementation notes APIs and usage Statistics What’s new March 30 – April 2, 2014#OFADevWorkshop2

Memory Registration Challenges Registered memory size limited to physical memory Requires special memory locking privileges Registration is a costly operation Requires careful application design for high performance –Bounce buffers –Pin-down caches Keeping address space and registered memory in synch is hard and error prone March 30 – April 2, 2014#OFADevWorkshop3

On Demand Paging MR pages are never pinned by the OS –Paged in when HCA needs them –Paged out when reclaimed by the OS HCA translation tables may contain non-present pages –Initially, a new MR is created with non-present pages –Virtual memory mappings don’t necessarily exist Advantages –Greatly simplified programming Reduce/eliminate registrations, no copying, no caches –Unlimited MR sizes No need for special privileges –Physical memory optimized to hold current working set For both CPU and IO access March 30 – April 2, 2014#OFADevWorkshop4

ODP Operation March 30 – April 2, 2014#OFADevWorkshop5 0x1000 PFN1 0x2000 PFN2 0x3000 PFN3 0x4000 PFN4 0x5000 PFN5 0x6000 PFN6 0x1000 0x2000 0x3000 0x4000 0x5000 0x6000 Address SpaceIO Virtual Address HCA data access HCA data access OS PTE change Page fault! ODP promise: IO virtual address mapping == Process virtual address mapping ODP promise: IO virtual address mapping == Process virtual address mapping

Implementation March 30 – April 2, 2014#OFADevWorkshop6 ib_core mlx5_core/ mlx5_ib mlx5_core/ mlx5_ib Key->MR tree MR interval tree HCA MR umem mlx5_ib_mr Page/DMA list HW translation tables Kernel QP page fault event mmu_notifer page invalidation

ODP capabilities March 30 – April 2, 2014#OFADevWorkshop7 enum odp_transport_cap_bits { ODP_SUPPORT_SEND = 1 << 0, ODP_SUPPORT_RECV = 1 << 1, ODP_SUPPORT_WRITE = 1 << 2, ODP_SUPPORT_READ = 1 << 3, ODP_SUPPORT_ATOMIC = 1 << 4, }; enum odp_general_caps { ODP_SUPPORT = 1 << 0, }; struct ibv_odp_caps { uint32_t comp_mask; uint32_t general_caps; struct { uint32_t rc_odp_caps; uint32_t uc_odp_caps; uint32_t ud_odp_caps; uint32_t xrc_odp_caps; } per_transport_caps; }; int ibv_query_odp_caps(struct ibv_context *context, struct ibv_odp_caps *caps, size_t caps_size); enum odp_transport_cap_bits { ODP_SUPPORT_SEND = 1 << 0, ODP_SUPPORT_RECV = 1 << 1, ODP_SUPPORT_WRITE = 1 << 2, ODP_SUPPORT_READ = 1 << 3, ODP_SUPPORT_ATOMIC = 1 << 4, }; enum odp_general_caps { ODP_SUPPORT = 1 << 0, }; struct ibv_odp_caps { uint32_t comp_mask; uint32_t general_caps; struct { uint32_t rc_odp_caps; uint32_t uc_odp_caps; uint32_t ud_odp_caps; uint32_t xrc_odp_caps; } per_transport_caps; }; int ibv_query_odp_caps(struct ibv_context *context, struct ibv_odp_caps *caps, size_t caps_size);

ODP Memory Regions Registering the whole address space –ibv_reg_mr(pd, NULL, (u64) -1, flags) –Memory windows may be used to provide granular remote access rights March 30 – April 2, 2014#OFADevWorkshop8 enum ibv_access_flags { IBV_ACCESS_LOCAL_WRITE= 1, IBV_ACCESS_REMOTE_WRITE= (1<<1), IBV_ACCESS_REMOTE_READ= (1<<2), IBV_ACCESS_REMOTE_ATOMIC= (1<<3), IBV_ACCESS_MW_BIND= (1<<4), IBV_ACCESS_ON_DEMAND= (1<<5) }; struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, size_t length, int access); enum ibv_access_flags { IBV_ACCESS_LOCAL_WRITE= 1, IBV_ACCESS_REMOTE_WRITE= (1<<1), IBV_ACCESS_REMOTE_READ= (1<<2), IBV_ACCESS_REMOTE_ATOMIC= (1<<3), IBV_ACCESS_MW_BIND= (1<<4), IBV_ACCESS_ON_DEMAND= (1<<5) }; struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, size_t length, int access);

Usage Example March 30 – April 2, 2014#OFADevWorkshop9 int main() { struct ibv_odp_caps caps; ibv_mr *mr; struct ibv_sge sge; struct ibv_send_wr wr;... if (ibv_query_odp_caps(ctx, &caps, sizeof(caps)) || !(caps.ud_odp_caps & ODP_SUPPORT_SEND)) return -1;... p = mmap(NULL, 10 * MB, PROT_READ | PROT_WRITE, MAP_SHARED, 0, 0);... mr = ibv_reg_mr(ctx->pd, p, 10 * MB, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_ON_DEMAND);... sge.addr = p; sge.lkey = mr->lkey; ibv_post_send(ctx->qp, &wr, &bad_wr);... munmap(p, 1 * MB); p = mmap(p, 1 * MB, PROT_READ | PROT_WRITE, MAP_SHARED, 0, 0);... ibv_post_send(ctx->qp, &wr, &bad_wr);... return 0; } int main() { struct ibv_odp_caps caps; ibv_mr *mr; struct ibv_sge sge; struct ibv_send_wr wr;... if (ibv_query_odp_caps(ctx, &caps, sizeof(caps)) || !(caps.ud_odp_caps & ODP_SUPPORT_SEND)) return -1;... p = mmap(NULL, 10 * MB, PROT_READ | PROT_WRITE, MAP_SHARED, 0, 0);... mr = ibv_reg_mr(ctx->pd, p, 10 * MB, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_ON_DEMAND);... sge.addr = p; sge.lkey = mr->lkey; ibv_post_send(ctx->qp, &wr, &bad_wr);... munmap(p, 1 * MB); p = mmap(p, 1 * MB, PROT_READ | PROT_WRITE, MAP_SHARED, 0, 0);... ibv_post_send(ctx->qp, &wr, &bad_wr);... return 0; }

Memory Prefetching Best effort hint –Not necessarily all pages are pre-fetched –No guarantees that pages remain resident –Asynchronous Can be invoked opportunistically in parallel to IO Use cases –Avoid multiple page faults by small transactions –Pre-fault a large region about to be accessed by IO EFAULT returned when –Range exceeds the MR –Requested pages not part of address space March 30 – April 2, 2014#OFADevWorkshop10 struct ibv_prefetch_attr { uint32_t comp_mask; int flags; /* IBV_ACCESS_LOCAL_WRITE */ void *addr; size_t length; }; int ibv_prefetch_mr(struct ibv_mr *mr, struct ibv_prefetch_attr *attr, size_t attr_size); struct ibv_prefetch_attr { uint32_t comp_mask; int flags; /* IBV_ACCESS_LOCAL_WRITE */ void *addr; size_t length; }; int ibv_prefetch_mr(struct ibv_mr *mr, struct ibv_prefetch_attr *attr, size_t attr_size);

Statistics Core statistics –Maintained by the IB core layer –Tracked on a per device basis –Reported by sysfs Use cases –Page fault pattern Warm-up Steady state –Paging efficiency –Detect thrashing –Measure pre-fetch impact March 30 – April 2, 2014#OFADevWorkshop11 /sys/class/infiniband_verbs/uverbs / invalidations_faults_contentions num_invalidation_pages num_invalidations num_page_fault_pages num_page_faults num_prefetches_handled Counter nameDescription invalidations_faults _contentions Number of times that page fault events were dropped or prefetch operations were restarted due to OS page invalidations num_invalidation_ pages Total number of pages invalidated during all invalidation events num_invalidationsNumber of invalidation events num_page_fault_p ages Total number of pages faulted in by page fault events num_page_faultsNumber of page fault events num_prefetches_h andled Number of prefetch Verb calls that completed successfully

Statistics (continued) Driver debug statistics –Maintained by the mlx5 driver –Tracked on a per device basis –Reported by debugfs Use cases –Track accesses to non- mapped memory –ODP MR usage March 30 – April 2, 2014#OFADevWorkshop12 /sys/kernel/debug/mlx5/ /odp_stats/ num_failed_resolutions num_mrs_not_found num_odp_mr_pages num_odp_mrs Counter nameDescription num_failed_resolutions Number of failed page faults that could not be resolved due to non-existing mappings in the OS num_mrs_not_found Number of faults that specified a non-existing ODP MR num_odp_mr_pages Total size in pages of current ODP MRs num_odp_mrsNumber of current ODP MRs

News Connect-IB Support –Initially UD and RC DC will follow –Address space key for local access Initial testing with OpenMPI –No more memory hooks! Release planned for MLNX_OFED-2.3 ODP patches submitted to kernel March 30 – April 2, 2014#OFADevWorkshop13

#OFADevWorkshop Thank You