Xen Network I/O Performance Analysis and Opportunities for Improvement

Slides:



Advertisements
Similar presentations
Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon Willy Zwaenepoel EPFL, Lausanne Jose Renato Santos Yoshio Turner.
Advertisements

Device Virtualization Architecture
Redesigning Xen Memory Sharing (Grant) Mechanism Kaushik Kumar Ram (Rice University) Jose Renato Santos (HP Labs) Yoshio Turner (HP Labs) Alan L. Cox (Rice.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Difference Engine: Harnessing Memory Redundancy in Virtual Machines by Diwaker Gupta et al. presented by Jonathan Berkhahn.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
1 Ally: OS-Transparent Packet Inspection Using Sequestered Cores Jen-Cheng Huang 1, Matteo Monchiero 2, Yoshio Turner 3, Hsien-Hsin Lee 1 1 Georgia Tech.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
CS 153 Design of Operating Systems Spring 2015
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 17, 2003 Topic: Virtual Memory.
Accurate and Efficient Replaying of File System Traces Nikolai Joukov, TimothyWong, and Erez Zadok Stony Brook University (FAST 2005) USENIX Conference.
Network Implementation for Xen and KVM Class project for E : Network System Design and Implantation 12 Apr 2010 Kangkook Jee (kj2181)
Improving IPC by Kernel Design Jochen Liedtke Shane Matthews Portland State University.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.
Xen and the Art of Virtualization Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, Andrew Warfield.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Virtual WiFi: Bring Virtualization from Wired to Wireless Lei Xia, Sanjay Kumar, Xue Yang Praveen Gopalakrishnan, York Liu, Sebastian Schoenberg, Xingang.
Support for Smart NICs Ian Pratt. Outline Xen I/O Overview –Why network I/O is harder than block Smart NIC taxonomy –How Xen can exploit them Enhancing.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.
LWIP TCP/IP Stack 김백규.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Xen I/O Overview. Xen is a popular open-source x86 virtual machine monitor – full-virtualization – para-virtualization para-virtualization as a more efficient.
Xen I/O Overview.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
CS 153 Design of Operating Systems Spring 2015 Lecture 17: Paging.
Virtual Machine Monitors: Technology and Trends Jonathan Kaldor CS614 / F07.
Penn State CSE “Optimizing Network Virtualization in Xen” Aravind Menon, Alan L. Cox, Willy Zwaenepoel Presented by : Arjun R. Nath.
Xen and The Art of Virtualization Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt & Andrew Warfield.
Politecnico di Torino Dipartimento di Automatica ed Informatica TORSEC Group Performance of Xen’s Secured Virtual Networks Emanuele Cesena Paolo Carlo.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Srihari Makineni & Ravi Iyer Communications Technology Lab
High Performance Network Virtualization with SR-IOV By Yaozu Dong et al. Published in HPCA 2010.
ND The research group on Networks & Distributed systems.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Introduction to virtualization
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Full and Para Virtualization
CSE598c - Virtual Machines - Spring Diagnosing Performance Overheads in the Xen Virtual Machine EnvironmentPage 1 CSE 598c Virtual Machines “Diagnosing.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Generic page-pool recycle facility?
CS161 – Design and Architecture of Computer
Xen and the Art of Virtualization
Zero-copy Receive Path in Virtio
Presented by Yoon-Soo Lee
CS161 – Design and Architecture of Computer
LWIP TCP/IP Stack 김백규.
Virtual Memory - Part II
CS703 - Advanced Operating Systems
TX bulking and qdisc layer
143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.
Memory vs. Networking Provoking and fixing memory bottlenecks
Xen: The Art of Virtualization
XenFS Sharing data in a virtualised environment
CSE 153 Design of Operating Systems Winter 2018
Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Introduction to the Intel x86’s support for “virtual” memory
Virtual Memory Hardware
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE451 Virtual Memory Paging Autumn 2002
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
VIRTIO 1.1 FOR HARDWARE Rev2.0
CSE 153 Design of Operating Systems Winter 2019
Low Overhead Interrupt Handling with SMT
Presentation transcript:

Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 2007

Motivation Network I/O has high CPU cost CPU cost for TCP connection at 1 Gbps (xen-unstable (03/16/2007) ; PV Linux guest; X86 - 32bit) Network I/O has high CPU cost TX: 350% cost of linux RX: 310% cost of linux November 30, 2018

Outline Performance Analysis for network I/O RX path (netfront/netback) Network I/O RX optimizations Network I/O TX optimization November 30, 2018

Performance Analysis For Network I/O RX Path November 30, 2018

Experimental Setup Network configuration Machines: HP Proliant DL580 (client and server) P4 Xeon 2.8 Ghz, 4 CPU (MT disabled), 64 GB, 512 KB L2, 2MB L3 NIC: Intel E1000 (1 Gbps) Network configuration Single switch connecting client and server Server configuration Xen unstable (c.s.14415 – March 16, 2007) (default xen0/xenU configs) Single guest (512MB), dom0 also with 512MB Benchmark Simple UDP micro benchmark (1gbps, 1500 bytes packets) November 30, 2018

CPU Profile for network RX path Cost of data copy is significant both in Linux and Xen Xen has the cost of an additional data copy Xen guest kernel alone uses more CPU than linux Most cost for Xen code is in dom0 November 30, 2018

Xen Code Cost Major Xen overhead is: grant table & dom page map/unmap Copy grant: several expensive atomic operations (lock instr.prefix): Atomic cmpxchg operation for updating status field (grant in use) Increment/decrement grant usage counters (multiple spinlock op) Increment/decrement src and dst page ref counts (get/put_page()) November 30, 2018

Dom0 Kernel Cost Bridge/network is large component in dom0 cost (Xen summit 2006) Can be reduced if netfilter bridge config option is disabled Xen new code: Netback, hypercall, swiotlb Higher interrupt overhead in Xen: extra code in evtchn.c Additional high cost functions in Xen (accounted in “other”) spin_unlock_irqrestore(), spin_trylock() November 30, 2018

Bridge netfilter cost What do we need to do to disable bridge netfilter by default? Should we add a netfilter hook in netback? November 30, 2018

Overhead in guest Netfront: 5 times more expensive than e1000 driver in Linux Memory op (mm): 2 times more expensive in Xen (?) grant table: high cost of atomic cmpxchg operation to revoke grant access “Other”: spin_unlock_irqrestore(), spin_trylock() (same as dom0) November 30, 2018

Source of Netfront Cost Netback copy packet data into netfront page fragments Netfront copies first 200 bytes of packet from fragment into main socket buffer data area Large netfront cost is due to this extra data copy November 30, 2018

Opportunities for Improvement on RX path November 30, 2018

Reduce RX head copy size No need to have all headers in main SKB data area Copy only Ethernet header (14 bytes) Network stack copies more data as needed November 30, 2018

Move grant data copy into guest CPU Dom0 grant access to data and guest copy it using copy grant Cost of 2nd copy to user buffer is reduced as data is already in the guest CPU cache (assumes cache is not evicted due to user process delaying read) Additional benefit: Improves dom0 (driver domain) scalability as more work is done at the guest side Data copy is more expensive in guest (alignment problem) November 30, 2018

Grant copy align problem Packet starts at half word … packet 2 16 bytes grant copy netback SKB nettfront SKB netfront fragment page copy in guest copy in dom0 copy is expensive when destination start is not at word boundary Fix: Copy also 2 prefix bytes  source and destination now aligned November 30, 2018

Fixing grant copy alignment Grant copy in guest becomes more efficient than in current Xen Grant copy in dom0: destination is word aligned but Source is not word aligned Can also be improved by copying additional prefix data Either 2 or (2+16) bytes November 30, 2018

Fixing alignment in current Xen Source alignment reduces copy cost Source and dest. at same buffer offset has better performance Reason (?): maybe because same offset in cache? Copy cost in dom0 is still more expensive than copy in guest. Different cache behavior November 30, 2018

Copy in guest has better cache locality grant copy in guest grant copy in dom0 Dom0 copy has more L2 cache misses than guest copy Dom0 copy has lower cache locality Guest post multiple pages on IO ring. All pages in ring must be used before the same page can be reused For guest copy, pages are allocated on demand and reused more often improving cache locality November 30, 2018

Possible grant optimizations Define new simple copy grant: Allow only one copy operation at a time No need to keep grant usage counters (remove lock) avoid cost of atomic cmpxchg operations Separate fields used for enabling grant and usage status Avoid incrementing/decrementing page ref counters Use an RCU scheme for page deallocation (lazy deallocation) November 30, 2018

Potential savings in grant modifications Results are optimistic Still need to implement grant modifications Results are based on eliminating current operations November 30, 2018

Coalescing netfront RX interrupts NIC (e1000) already coalescing HW interrupts (~10 packets/int) Batching packets can provide additional benefit 10% for 32 packets But adds extra latency Dynamic coalescing scheme should be beneficial November 30, 2018

Coalescing effect on Xen cost Xen in guest context Except grant and domain page map/unmap, all other Xen costs are amortized by larger batches An additional reason for optimizing grant November 30, 2018

Combining all RX optimizations Cost of network I/O for RX can be significantly reduced From ~250% to ~70% ovehead (compared to linux) Largest improvement comes for moving grant copy to guest CPU November 30, 2018

Optimization for TX path November 30, 2018

Lazy page mapping on TX Dom0 only needs to access packet headers No need to map guest pages with packet payload NIC device access memory directly through DMA Avoid mapping guest page on packet TX Copy packet headers using I/O ring Modified grant operation returns machine address (for DMA) but does not map page in dom0. Provide page fault handler to deal with cases in which dom0 needs to access payload Packet to dom0/domU; netfilter rules November 30, 2018

Benefit of lazy TX page mapping 1 Gbps TCP TX (64KB msg) 1 Gbps TCP RX Performance improvement for TX optimization ~10% for large TX ~8% for TCP RX due to ACKs Some additional improvement may be possible with grant optimizations November 30, 2018

Questions ? November 30, 2018