Agenda About us Why para-virtualize RDMA Project overview Open issues

Slides:



Advertisements
Similar presentations
Device Virtualization Architecture
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Accelerating the Path to the Guest
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Network Implementation for Xen and KVM Class project for E : Network System Design and Implantation 12 Apr 2010 Kangkook Jee (kj2181)
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
1 Device Management The von Neumann Architecture System Architecture Device Management Polling Interrupts DMA operating systems.
KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor Fall 2014 Presented By: Probir Roy.
虛擬化技術 Virtualization Techniques
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
LWIP TCP/IP Stack 김백규.
© 2010 IBM Corporation Plugging the Hypervisor Abstraction Leaks Caused by Virtual Networking Alex Landau, David Hadas, Muli Ben-Yehuda IBM Research –
Operating Systems ECE344 Ashvin Goel ECE University of Toronto OS-Related Hardware.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
RDMA Bonding Liran Liss Mellanox Technologies. Agenda Introduction Transport-level bonding RDMA bonding design Recovering from failure Implementation.
Datacenter Fabric Workshop August 22, 2005 Reliable Datagram Sockets (RDS) Ranjit Pandit SilverStorm Technologies
InfiniBand support for Socket- based connection model by CM Arkady Kanevsky November 16, 2005 version 4.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
CS333 Intro to Operating Systems Jonathan Walpole.
UDI Advanced Topics DMA and Interrupts Robert Lipe UDI Development Team Lead
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
OpenFabrics 2.0 rsockets+ requirements Sean Hefty - Intel Corporation Bob Russell, Patrick MacArthur - UNH.
Chapter 4: server services. The Complete Guide to Linux System Administration2 Objectives Configure network interfaces using command- line and graphical.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
KVM: Virtualisation The Linux Way Amit Shah GEEP.
I/O Systems Shmuel Wimer prepared and instructed by
Lecture 15: IO Virtualization
Chapter 13: I/O Systems.
Input/Output Device Drivers
Module 12: I/O Systems I/O hardware Application I/O Interface
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Jonathan Walpole Computer Science Portland State University
Zero-copy Receive Path in Virtio
Overview QEMU VM VMWare pvrdma Func1: RoCE Func0: ETH ETH
LWIP TCP/IP Stack 김백규.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Operating Systems (CS 340 D)
Chapter 2: System Structures
Fabric Interfaces Architecture – v4
I/O system.
CS 286 Computer Organization and Architecture
Introduction to Networking
XenFS Sharing data in a virtualised environment
Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson
OS Virtualization.
Chapter 13: I/O Systems.
Toward Effective and Fair RDMA Resource Sharing
Chapter 20 Network Layer: Internet Protocol
Xen Network I/O Performance Analysis and Opportunities for Improvement
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
vDPA for Vhost Acceleration
Open vSwitch HW offload over DPDK
Virtio/Vhost Status Quo and Near-term Plan
Lecture Topics: 11/1 General Operating System Concepts Processes
Accelerate Vhost with vDPA
Chapter 5: I/O Systems.
Lecture 25: Multiprocessors
Computer System Overview
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Xen and the Art of Virtualization
Chapter 13: I/O Systems.
Paging Andrew Whitaker CSE451.
Module 12: I/O Systems I/O hardwared Application I/O Interface
Interrupt Message Store
Presentation transcript:

Agenda About us Why para-virtualize RDMA Project overview Open issues Future plans

About us Marcel from KVM team in Redhat Yuval from Networking/RDMA team in Oracle This is a shared-effort open source project to implement a paravirt RDMA device. We implement VMware’s pvrdma device on QEMU since it save us time (guest drivers and lib)

Why para-virtualize RDMA Make Guests device-agnostic while leveraging hw capabilities SRIOV becomes optional (by using multiple GIDs) Migration support Memory overcommit (without hw support) Fast build of testing cluster

Overview QEMU VM Kernel HW RoCE device PF VF netdev ETH rxe Ib device PVRDMA device PVRDMA driver Func1: RoCE Func0: ETH TAP

Expected performance Comparing: Ethernet virtio NICs (state of the art para-virt) PVRDMA RoCE with soft Roce backend PVRDMA RoCE with Phys RDMA VF. Looking for throughput of small/large packets

PCI registers BAR 0: MSIX Vec0: RING EVT | Vec1: Async EVT | Vec3: Cmd EVT BAR 1: Regs VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC BAR 2: UAR QP_NUM | SEND|RECV || CQ_NUM | ARM|POLL

PCI registers – BAR 1 BAR 1: Regs VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC Version DSR: Device Shared Region (command channel) CMD slot address RESPONSE slot address Device CQ ring CTL Activate device Quiesce Reset REQ Execute command at CMD slot The result will be stored in RESPONSE slot ERR Error details IMR Interrupt mask

PCI registers – BAR 2 BAR 2: UAR QP_NUM | SEND|RECV || CQ_NUM | ARM|POLL QP operations - Guest driver writes to QP offset (0) QP num and op mask - The write operation ends after the operation is sent to the backend device CQ operations - Guest driver writes to CQ offset (4) CQ num and op mask - The write operation ends after the command is sent to Only one command at a time

Resource manager All resources are “virtualized” (1-1 mapping with backend dev) PDs, QPs, CQs,… The pvrdma device shares the resources memory allocated by the guest driver by means of a “shared page directory” Host VM HW QEMU Memory CQ HW CQ Virt CQ CQ Page directory QP Driver CQ Page directory Snd HW QP VirtQP QP Page directory Rcv QP Page directory CMD Channel (DSR) / Info passed in Create QP/CQ

Flows – Create CQ Guest driver Allocates pages for CQ ring Creates page directory to hold CQ ring's pages Initializes CQ ring Initializes 'Create CQ' command object (cqe, pdir etc) Copies the command to 'command' address (DSR) Writes 0 into REQ register Device Reads request object from 'command' address Allocates CQ object and initialize CQ ring based on pdir Creates backend(HW) CQ Writes operation status to ERR register Posts command-interrupt to guest Reads HW response code from ERR register

Flows – Create QP Guest driver Allocates pages for send and receive rings Creates page directory to hold the ring's pages Initializes 'Create QP' command object (max_send_wr, send_cq_handle, recv_cq_handle, pdir etc) Copies the object to 'command' address Writes 0 into REQ register Device Reads request object from 'command' address Allocates QP object and initialize Send and recv rings based on pdir Send and recv ring state Creates backend QP Writes operation status to ERR register Posts command-interrupt to guest Reads HW response code from ERR register

Flows – Post receive Guest driver Initializes a wqe and place it on recv ring Writes to qpn|qp_recv_bit (31) to QP offset in UAR Device Extracts qpn from UAR Walks through the ring and do the following for each wqe Prepares backend CQE context to be used when receiving completion from backend (wr_id, op_code, emu_cq_num) For each sge prepares backend sge; maps the dma address to qemu virtual address Calls backend's post_send

Flows – Process completion A dedicated thread is used to process backend events At initialization it attach to device and create communication channel Thread main loop: Polls completion Unmaps sge's virtual addresses Extracts emu_cq_num, wr_id and op_code from context Writes CQE to CQ ring Writes CQ number to device CQ Sends completion-interrupt to guest Deallocates context Acks the event (ibv_ack_cq_events)

Open issues RDMA CM support Guest memory registration Migration support Memory overcommit support Providing GIDs to guests

Open issues (cont) RDMA CM support Emulate QP1 at QEMU level Modify RDMA CM to communicate with QEMU instances: Pass incoming MADs to QEMU Allow QEMU to forward MADs What would be the preferred mechanism?

Open issues (cont) Guest memory registration Guest userspace: Register memory region verb accept only a single VA address specifying both the on-wire Key and a memory range. Guest kernels: In-kernel code may need to register all memory (ranges not used are not accessed → kernel is trusted) Memory hotplug changes present ranges

Open issues (cont) Migration support Moving QP numbers – ensure same QP numbers on destination Memory tracking Migration requires detection of hw accesses to pages Maybe we can avoid the need for it Moving QP state How QP should behave during migration , some “busy state” Network update/packet loss recovery Can the system recover from some packet loss?

Open issues (cont) Memory overcommit support Do we really need “on-demand” pages? Everything goes through QEMU which is a User Level app in host that fw the request to host kernel. Did we miss anything?

Open issues (cont) Providing GIDs to guests (multi-gid) Instead of coming up with extra ipv6 addresses, would be better to set the host GID based on guest ipv6 address. No API for doing that from user level.

Future plans Submit VMware’s PVRDMA device implementation to QEMU (and later add RoCE v2 support) Improve performance Move to a virtio based RDMA device We would like to agree on a clean design: Virtio queue for the command channel Virtio queue for each Rcv/Send/Cq buffer Anything else?