Software Routers: NetSlice Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking October 15,

Slides:



Advertisements
Similar presentations
NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.
Advertisements

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Design and Implementation of a Consolidated Middlebox Architecture 1 Vyas SekarSylvia RatnasamyMichael ReiterNorbert Egi Guangyu Shi.
Multiple Processor Systems
Chapter 8 Hardware Conventional Computer Hardware Architecture.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
Spring 2002CS 4611 Router Construction Outline Switched Fabrics IP Routers Tag Switching.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
Chapter Hardwired vs Microprogrammed Control Multithreading
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Data Center Middleboxes Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking November 24,
Software Routers: NetMap Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking October 8, 2014.
Alternative Switching Technologies: Optical Circuit Switches Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance.
Data Center Networks and Basic Switching Technologies Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems.
ECE 526 – Network Processing Systems Design
Data Center Virtualization: Open vSwitch Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking.
Netslice: Enabling Critical Network Infrastructure with Commodity Routers Prof. Hakim Weatherspoon, Cornell University Joint with Tudor Marian, Ki Suh.
Router Architectures An overview of router architectures.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Router Architectures An overview of router architectures.
Data Center Traffic and Measurements: Available Bandwidth Estimation Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance.
Data Center Virtualization: VirtualWire Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
Computer System Architectures Computer System Software
Hosting Virtual Networks on Commodity Hardware VINI Summer Camp.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Data Center Virtualization: Xen and Xen-blanket
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
Alternative Switching Technologies: Wireless Datacenters Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Multi-core architectures. Single-core computer Single-core CPU chip.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
On the processing time for detection of Skype traffic P.M. Santiago del Río, J. Ramos, J.L. García-Dorado, J. Aracil Universidad Autónoma de Madrid A.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
CS 4396 Computer Networks Lab Router Architectures.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Full and Para Virtualization
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Spring 2000CS 4611 Router Construction Outline Switched Fabrics IP Routers Extensible (Active) Routers.
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
Presented by: Xianghan Pei
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
NFV Compute Acceleration APIs and Evaluation
Is Virtualization ready for End-to-End Application Performance?
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Alternative system models
The Multikernel A new OS architecture for scalable multicore systems
Shadow: Scalable and Deterministic Network Experimentation
Presentation transcript:

Software Routers: NetSlice Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking October 15, 2014 Slides from Ki Suh Lee’s presentation at the ACM/IEEE Symposium on Architectures for Networking and Communication Systems (ANCS), October 2012.

Overview and Basics Data Center Networks – Basic switching technologies – Data Center Network Topologies (today and Monday) – Software Routers (eg. Click, Routebricks, NetMap, Netslice) – Alternative Switching Technologies – Data Center Transport Data Center Software Networking – Software Defined networking (overview, control plane, data plane, NetFGPA) – Data Center Traffic and Measurements – Virtualizing Networks – Middleboxes Advanced Topics Where are we in the semester?

Goals for Today NetSlices: Scalable Multi-Core Packet Processing in User-Space – T. Marian, K. S. Lee, and H. Weatherspoon. ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), October 2012, pages

Packet Processors Essential for evolving networks – Sophisticated functionality – Complex performance enhancement protocols

Packet Processors Essential for evolving networks – Sophisticated functionality – Complex performance enhancement protocols Challenges: High-performance and flexibility – 10GE and beyond – Tradeoffs

Software Packet Processors Low-level (kernel) vs. High-level (userspace) Parallelism in userspace: Four major difficulties – Overheads & Contention – Kernel network stack – Lack of control over hardware resources – Portability

Overheads & Contention Cache coherence Memory Wall Slow cores vs. Fast NICs Memory CPU Memory NIC

Kernel network stack & HW control Raw socket: all traffic from all NICs to user-space Too general, hence complex network stack Hardware and software are loosely coupled Applications have no control over resources Network Stack Application Raw socket Network Stack Network Stack Network Stack Network Stack Network Stack Network Stack Network Stack Network Stack Application Network Stack

Portability Hardware dependencies Kernel and device driver modifications – Zero-copy – Kernel bypass

Outline Difficulties in building packet processors NetSlice Evaluation Discussions Conclusion

NetSlice Give power to the application – Overheads & Contention – Lack of control over hardware resources Spatial partitioning exploiting NUMA architecture – Kernel network stack Streamlined path for packets – Portability No zero-copy, kernel & device driver modifications

NetSlice Spatial Partitioning Independent (parallel) execution contexts – Split each Network Interface Controller (NIC) One NIC queue per NIC per context – Group and split the CPU cores – Implicit resources (bus and memory bandwidth) Temporal partitioning (time-sharing) Spatial partitioning (exclusive-access)

NetSlice Spatial Partitioning Example 2x quad core Intel Xeon X5570 (Nehalem) – Two simultaneous hyperthreads – OS sees 16 CPUs – Non Uniform Memory Access (NUMA) QuickPath point-to-point interconnect – Shared L3 cache

Heavyweight Network Stack Streamlined Path for Packets Inefficient conventional network stack – One network stack “to rule them all” – Performs too many memory accesses – Pollutes cache, context switches, synchronization, system calls, blocking API

Portability No zero-copy – Tradeoffs between portability and performance – NetSlices achieves both No hardware dependency A run-time loadable kernel module

NetSlice API Expresses fine-grained hardware control Flexible: based on ioctl Easy: open, read, write, close 1: #include "netslice.h" 2: 3: structnetslice_rw_multi { 4: int flags; 5: } rw_multi; 6: 7: structnetslice_cpu_mask { 8: cpu_set_tk_peer, u_peer; 9: } mask; 10: 11: fd = open("/dev/netslice-1", O_RDWR); 12: 13: rw_multi.flags = MULTI_READ | MULTI_WRITE; 14: ioctl(fd, NETSLICE_RW_MULTI_SET, &rw_multi); 15: ioctl(fd, NETSLICE_CPUMASK_GET, &mask); 16: sched_setaffinity(getpid(), sizeof(cpu_set_t), 17: &mask.u_peer); 18 19: for (;;) { 20: ssize_tcnt, wcnt = 0; 21: if ((cnt = read(fd, iov, IOVS)) < 0) 22: EXIT_FAIL_MSG("read"); 23: 24: for (i = 0; i<cnt; i++) 25: /* iov_rlen marks bytes read */ 26: scan_pkg(iov[i].iov_base, iov[i].iov_rlen); 27: do { 28: size_twr_iovs; 29: /* write iov_rlen bytes */ 30: wr_iovs = write(fd, &iov[wcnt], cnt-wcnt); 31: if (wr_iovs< 0) 32: EXIT_FAIL_MSG("write"); 33: wcnt += wr_iovs; 34: } while (wcnt<cnt); 35: }

NetSlice Evaluation Compare against state-of-the-art – RouteBricks in-kernel, Click & pcap-mmap user-space Additional baseline scenario – All traffic through single NIC queue (receive-livelock) What is the basic forwarding performance? – How efficient is the streamlining of one NetSlice? How is NetSlice scaling with the number of cores?

Experimental Setup R710 packet processors – dual socket quad core 2.93GHz Xeon X5570 (Nehalem) – 8MB of shared L3 cache and 12GB of RAM 6GB connected to each of the two CPU sockets Two Myri-10G NICs R900 client end-hosts – four socket 2.40GHz Xeon E7330 (Penryn) – 6MB of L2 cache and 32GB of RAM

Simple Packet Routing End-to-end throughput, MTU (1500 byte) packets 1/11 of NetSlice 74% of kernel

Linear Scaling with CPUs IPsec with 128 bit key—typically used by VPN – AES encryption in Cipher-block Chaining mode

Outline Difficulties in building packet processors Netslice Evaluation Discussions Conclusion

Software Packet Processors Can support 10GE and more at line-speed – Batching Hardware, device driver, cross-domain batching – Hardware support Multi-queue, multi-core, NUMA, GPU – Removing IRQ overhead – Removing memory overhead Including zero-copy – Bypassing kernel network stack

Software Packet Processors BatchingParallelismZero-CopyPortabilityDomain Raw socketUser RouteBricksKernel PacketShaderUser PF_RINGUser netmapUser Kernel-bypassUser NetSliceUser

Software Packet Processors BatchingMulti-queueZero-CopyPortabilityDomain Raw socketUser RouteBricksKernel PacketShaderUser PF_RINGUser netmapUser Kernel-bypassUser NetSliceUser

Software Packet Processors BatchingMulti-queueZero-CopyPortabilityDomain Raw socketUser RouteBricksKernel PacketShaderUser PF_RINGUser netmapUser Kernel-bypassUser NetSliceUser

Software Packet Processors BatchingMulti-queueZero-CopyPortabilityDomain Raw socketUser RouteBricksKernel PacketShaderUser PF_RINGUser netmapUser Kernel-bypassUser NetSliceUser

BatchingParallelismZero-CopyPortabilityDomain Raw socketUser RouteBricksKernel PacketShaderUser PF_RINGUser netmapUser Kernel-bypassUser NetSliceUser Software Packet Processors

BatchingParallelismZero-CopyPortabilityDomain Raw socketUser RouteBricksKernel PacketShaderUser PF_RINGUser netmapUser Kernel-bypassUser NetSliceUser Software Packet Processors Optimized for RX path only

Software Packet Processors BatchingParallelismZero-CopyPortabilityDomain Raw socketUser RouteBricksKernel PacketShaderUser PF_RINGUser netmapUser Kernel-bypassUser NetSliceUser

Discussions 40G and beyond – DPI, FEC, DEDUP, … Deterministic RSS Small packets

Conclusion NetSlices: A new abstraction – OS support to build packet processing applications – Harness implicit parallelism of modern hardware to scale – Highly portable Webpage:

Before Next time Project Progress – Need to setup environment as soon as possible – And meet with groups, TA, and professor Lab3 – Packet filter/sniffer – Due Thursday, October 16 – Use Fractus instead of Red Cloud Required review and reading for Friday, October 15 – “NetSlices: Scalable Multi-Core Packet Processing in User-Space”, T. Marian, K. S. Lee, and H. Weatherspoon. ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), October 2012, pages – – Check piazza: Check website for updated schedule