VIRTIO 1.1 FOR HARDWARE Rev2.0

Slides:

Advertisements

Similar presentations

Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.

Advertisements

More on File Management

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

Page 15/4/2015 CSE 30341: Operating Systems Principles Allocation of Frames  How should the OS distribute the frames among the various processes?  Each.

Chapter 101 Cleaning Policy When should a modified page be written out to disk?  Demand cleaning write page out only when its frame has been selected.

Virtual Memory Background Demand Paging Performance of Demand Paging

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Memory Management (II)

Chapter 12 Pipelining Strategies Performance Hazards.

Computer Organization and Architecture

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

Gursharan Singh Tatla Transport Layer 16-May

Multicore Navigator: Queue Manager Subsystem (QMSS)

1 File System Implementation Operating Systems Hebrew University Spring 2010.

Synchronization and Communication in the T3E Multiprocessor.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.

8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Data Sharing. Data Sharing in a Sysplex Connecting a large number of systems together brings with it special considerations, such as how the large number.

Intro  Scratchpad rings and queues.  First – In – Firs – Out (FIFO) data structure.  Rings are fixed-sized, circular FIFO.  Queues not fixed-size.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 9: Virtual Memory.

Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.

CE Operating Systems Lecture 17 File systems – interface and implementation.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

Processes and Virtual Memory

Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.

Queue Manager and Scheduler on Intel IXP John DeHart Amy Freestone Fred Kuhns Sailesh Kumar.

Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.

WINLAB Open Cognitive Radio Platform Architecture v1.0 WINLAB – Rutgers University Date : July 27th 2009 Authors : Prasanthi Maddala,

Submission doc.: IEEE 11-15/1060r0 September 2015 Eric Wong (Apple)Slide 1 Receive Operating Mode Indication for Power Save Date: Authors:

Data Link Control. The two main functions of the data link layer are data link control and media access control. The first, data link control, deals with.

Translation Lookaside Buffer

Dynamic Scheduling Why go out of style?

Computer Organization

Memory Management.

Non Contiguous Memory Allocation

Zero-copy Receive Path in Virtio

Module 11: File Structure

CHP - 9 File Structures.

A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.

CS703 - Advanced Operating Systems

Chapter 9: Virtual Memory

COMBINED PAGING AND SEGMENTATION

File System Structure How do I organize a disk into a file system?

PQI vs. NVMe® Queuing Comparison

Module 9: Virtual Memory

Chapter 11: File System Implementation

Chapter 9: Virtual-Memory Management

Module IV Memory Organization.

An NP-Based Router for the Open Network Lab Overview by JST

5: Virtual Memory Background Demand Paging

Xen Network I/O Performance Analysis and Opportunities for Improvement

Ka-Ming Keung Swamy D Ponpandi

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Introduction to the Intel x86’s support for “virtual” memory

Advanced Computer Architecture

Virtual Memory Hardware

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CSC3050 – Computer Architecture

Data Structures & Algorithms

Cache coherence CEG 4131 Computer Architecture III

Ch 17 - Binding Protocol Addresses

Virtual Memory: Working Sets

Module 9: Virtual Memory

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

VIRTIO 1.1 FOR HARDWARE Rev2.0 Kully Dhanoa, Intel

Note Proposals made in this presentation are evolving This presentation contains proposals as of 8th March 2017

Objective I VirtIO 1.1 Spec to enable efficient implementation for both software AND hardware solutions Where software and hardware optimum solutions differ: Option 1: Suffer non-optimum performance (assuming not significant) for sake of simpler spec Option 2: Allow different implementations via Capability feature to ensure optimum software AND hardware implementations Understand software and hardware implementation concerns

Objective II NOTE: Proposals are biased towards: FPGA implementation VirtIO_net device PCIe transport mechanism Collectively we must consider implications on: Software running on guest Pure software implementation Other Virtio device types Other transport mechanisms? THEN decide whether to : adopt proposals Modify proposals Drop proposals

Agenda Guest signaling available descriptors Device signaling used descriptors Out-of-order processing Indirect chaining Rx: Fixed Buffer sizes Data/Descriptor alignment boundaries

Guest signaling available descriptors

Overview Current Proposal: Each descriptor has 1 bit flag DESC_HW Guest creates descriptors and then sets DESC_HW flag Host/FPGA reads descriptors and can use if DESC_HW is set After finishing with descriptor, Host/FPGA clears DESC_HW flag New Proposal: Instead of DESC_HW flag, each VirtIO queue has a single tail pointer Ok to have 1 MMIO address for all tail pointers: {Queue no, Tail index, <optional values: total pkts, total payload bytes>}

Cons of current proposal Memory 1 Desc #0 Create Desc #0 Create Desc #1 Flush cache line 1 Desc #1 1 Desc #0 1 Desc #1 Cache GUEST FPGA Wasted PCIe bandwidth Wasted PCIe bandwidth Read Desc Read Desc Delay ~1us Rx 2 Desc Read 2 Desc Delay ~1us Rx 2 Desc Process valid desc Desc are invalid. Wasted PCIe bandwidth

New Proposal No Desc_HW flags Every VirtIO queue has its own: Head pointer (lives in FPGA) Not used by guest RW for FPGA Tail pointer (lives in FPGA) W for guest RO for FPGA No of valid desc for FPGA = tail – head Space for guest to add desc = head - tail

Pros of new proposal Flush cache line Memory 1 Desc #0 Desc #0 Update tail pointer Create Desc #3 Create Desc #2 Create Desc #1 Create Desc #0 1 Desc #1 Flush cache line 1 Desc #2 1 Desc #3 1 Desc #3 1 Desc #2 1 Desc #1 1 Desc #0 Tail ptr copy Cache GUEST FPGA Tail ptr Head ptr Desc avail 0 Desc avail Read 4 Desc Delay ~1us Rx 4 Desc Update Head pointer 1 cache line flush per 4 valid descriptors No wasted PCIe bandwidth 4 Desc avail Process 4 valid descriptors

Device signaling used Descriptors

Overview I Current Proposal: FPGA clears each descriptor’s DESC_HW flag (1 bit) after it has finished with the descriptor New Proposal: FPGA does not need to clear DESC_HW flag for every descriptor Guest controls which descriptors need to have their DESC_HW cleared: Descriptor has an extra 1 bit field, WB (Write-Back): WB=1 => FPGA must writeback this descriptor after use (at the minimum, clear the DESC_HW flag) WB=0 => FPGA need not writeback descriptor Saves wasting PCIe bandwidth : In many scenarios descriptor data need not be written back i.e. Tx, Rx for network devices, where packet metadata is prepended to packet data.

Device signaling used descriptors Time T0 : Descriptor Table Time T3 : Descriptor Table T3 : Guest detects DESC_HW flag cleared This indicates that Desc 3 and ALL previous desc up to last desc with WB=1 are available to guest HW : WB HW : WB 1 0 1 Desc #0 1 1 0 Desc #0 1 0 1 Desc #1 1 0 1 Desc #1 Desc avail to Guest 1 1 0 Desc #2 1 0 1 Desc #2 Guest polling Guest polling 0 1 1 Desc #3 1 1 1 Desc #3 1 1 0 1 1 0 1 1 0 Desc #7 1 1 0 Desc #7 1 1 0 1 1 0 Tail ptr GUEST Tail ptr FPGA Desc #0 Writeback Desc 3 ONLY Desc #1 Desc #2 Desc #3 Time T1 : FPGA finished with 4 descriptors

VirtIO devices that must write back descriptors Still supported: Device must write back Descriptors whose WB flag is set Device may optionally write back Descriptors whose WB flag is clear Scenarios where Device may have to write back every Descriptor: Rx where metadata is contained within descriptor and not in packet buffer Out-of-order processing Guest should set WB flag for all descriptors

Out-of-order processing

Overview Current Proposal: FPGA reads descriptors from queue but can process them in any order. Once finished with a descriptor it is written back to the ring with DESC_HW flag cleared It need not be written back to same location it was read from. Descriptor can only be written back to a location owned by FPGA (i.e. DESC_HW flag set) and to a location that the FPGA has already cached the descriptor. No change: Hardware implementations will unlikely process descriptors out-of-order from a particular queue. Feature will be negotiatable upon startup However nothing preventing hardware implementations from processing out of order :

Out-of-order processing Out-of-order processing allows descriptors to be written back to Desc Table in any order. DESC_HW flag cleared for used Descriptors Time T0 : Descriptor Table Time T3 : Descriptor Table HW : WB HW : WB Desc #0 0 1 Desc #3 1 1 1 1 Desc #1 0 1 Desc #2 Desc #2 1 1 Desc #2 1 1 1 1 Desc #3 1 1 Desc #3 Tail ptr Tail ptr GUEST FPGA Desc #0 Desc #0 Desc #1 Desc #1 Desc #2 Desc #2 Desc #3 Desc #3 Time T1 : Descriptors copied into FPGA Time T2 : 2 Descriptors processed

Indirect Chaining

Overview Current Proposal: Indirect chaining is a negotiable feature? Hardware: Very unlikely that hardware implementations would support this due to extra latency of fetching actual descriptors. Is this acceptable? i.e. what is the reason for indirect chaining? Would some guests not work unless it is supported?

Rx Fixed Buffer Sizes

Overview Current Proposal: Guest is free to chose whatever buffer sizes it wishes for Tx and Rx Buffers Theoretically within a ring, a guest could have different buffer sizes Is this really done for Rx Buffers? If so, what is the advantage? Tx Buffers : I realise some OS create separate small buffers to hold network pkt headers with larger buffers for pkt data New Proposal: Guest negotiates with device the size of a Rx Buffer for a ring Each descriptor in that ring will have same size buffer Different rings can have different sized buffers Device can stipulate minimum (and max?) Rx Buffer sizes Device can stipulate Rx Buffer size multiple e.g. Rx Buffer size must be a multiple of 32B Is this acceptable?

Rx Fixed Buffer Size FPGA Variable Buffer Sizes FPGA Fixed Buffer Sizes FPGA Descriptor Buffer FPGA Descriptor Buffer Buffer #0 Buffer #0 Rx Desc #0 Rx Desc #0 Rx Desc #1 Buffer #1 Rx Desc #1 Buffer #1 Rx Desc #2 Rx Desc #2 Buffer #2 Rx Desc #3 Rx Desc #3 Buffer #3 Buffer #2 Rx Desc #M Rx Desc #M Buffer #3 Finite Descriptor Buffer Total Data Buffer size supported by Descriptor Buffer is UNKNOWN Therefore difficult to predict data throughput possible to maintain Assuming smallest buffer size of 12B would lead to oversized Descriptor Buffer in most scenarios Finite Descriptor Buffer Total Data Buffer size supported by Descriptor Buffer is KNOWN = No desc * Rx Buffer size Data throughput = func( min{min pkt size, Rx Buffer size})

Fixed Rx Buffer size v Minimum Rx Buffer size From previous slide: Appears that just increasing minimum Rx Buffer size from 12B to some sensible value (e.g. 256B) would be sufficient? YES, this would help immensely BUT more optimum/easier hardware design if Rx Buffer size is fixed Assuming hardware implementation: Fetch Rx descriptors on demand As data enters device, descriptors are fetched as needed If device knows the Rx buffer size, it can precalculate how many descriptors would be required for the packet and so fetch them efficiently as a batch. No wasting local Descriptor buffer space for descriptors that will not be needed Note: For hardware implementations: Prefetching Rx descriptors Specifying just a sensible minimum Rx Buffer size may be acceptable. However having fixed Rx Buffer size would allow it to prefetch the right amount of descriptors to maintain throughput.

Data/Descriptor Alignment Boundaries

Overview Current Proposal: Guest is free to chose descriptor alignment to minimum of xByte boundary? Guest is free to chose data buffer alignment to any byte boundary? New Proposal: Descriptors are aligned on a 16B boundary Guest negotiates with device on required data buffer alignment Could be from 4B – 128B boundary ? Wouldn’t s/w benefit from : cacheline aligned buffer start addresses? Integer number of descriptors per cacheline?

PCIe packets Single 4B aligned access Single 4B unaligned access Guest Guest PCIe Header PCIe Header PCIe TLP PCIe TLP Data Word Data Word Data Word 32bits 32bits FPGA FPGA 4B UNaligned accesses add inefficiency to PCIe packets (TLPs)

Internal FPGA data alignment: Implementation dependant Single 32B aligned access Single 32B UNaligned access Guest Guest FPGA FPGA PCIe Core PCIe Core Data Word Data Word B31 B2 B1 B0 B31 B2 B1 B0 Data Word B31 B2 B1 B0 256bits (32B) 256bits (32B) To achieve high throughput, FPGA data widths can be 256bits (or 512bits or even higher in future) Note also dependant upon clk frequency Chained descriptors with buffers starting on unaligned boundaries exacerbates problem Accesses may have to be aligned to FPGA word boundaries for ease of design.