Eric Keller Oral General Exam 5/5/08 Multi-Level Architecture for Data Plane Virtualization.

Slides:



Advertisements
Similar presentations
Building Fast, Flexible Virtual Networks on Commodity Hardware Nick Feamster Georgia Tech Trellis: A Platform for Building Flexible, Fast Virtual Networks.
Advertisements

NetServ Dynamic in-network service deployment Henning Schulzrinne (Columbia University) Srinivasan Seetharaman (Georgia Tech) Volker Hilt (Bell Labs)
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
Threads, SMP, and Microkernels
NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius
Supercharging PlanetLab : a high performance, Multi-Application, Overlay Network Platform Written by Jon Turner and 11 fellows. Presented by Benjamin Chervet.
Supercharging PlanetLab A High Performance,Multi-Alpplication,Overlay Network Platform Reviewed by YoungSoo Lee CSL.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
An Overview of Software-Defined Network Presenter: Xitao Wen.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
PlanetLab Operating System support* *a work in progress.
Performance Evaluation of Open Virtual Routers M.Siraj Rathore
Xen , Linux Vserver , Planet Lab
Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.
Router Architecture : Building high-performance routers Ian Pratt
1 In VINI Veritas: Realistic and Controlled Network Experimentation Jennifer Rexford with Andy Bavier, Nick Feamster, Mark Huang, and Larry Peterson
Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:
CS 104 Introduction to Computer Science and Graphics Problems Operating Systems (4) File Management & Input/Out Systems 10/14/2008 Yang Song (Prepared.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.
Common System Components
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
The Future of the Internet Jennifer Rexford ’91 Computer Science Department Princeton University
Container-based OS Virtualization A Scalable, High-performance Alternative to Hypervisors Stephen Soltesz, Herbert Pötzl, Marc Fiuczynski, Andy Bavier.
Router Architectures An overview of router architectures.
An Overview of Software-Defined Network Presenter: Xitao Wen.
Christopher Bednarz Justin Jones Prof. Xiang ECE 4986 Fall Department of Electrical and Computer Engineering University.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Programmable Data Planes COS 597E: Software Defined Networking.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Paper Review Building a Robust Software-based Router Using Network Processors.
System Calls 1.
LECTURE 9 CT1303 LAN. LAN DEVICES Network: Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and.
Hosting Virtual Networks on Commodity Hardware VINI Summer Camp.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Protocols and the TCP/IP Suite
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.
1 Liquid Software Larry Peterson Princeton University John Hartman University of Arizona
IP Forwarding.
Applied research laboratory David E. Taylor Users Guide: Fast IP Lookup (FIPL) in the FPX Gigabit Kits Workshop 1/2002.
A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 7 OS System Structure.
Embedded Runtime Reconfigurable Nodes for wireless sensor networks applications Chris Morales Kaz Onishi 1.
Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
Department of Computer Science and Software Engineering
Processes and Virtual Memory
Rehab AlFallaj.  Network:  Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and do specific task.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Container-based Operating System Virtualization: A scalable, High-performance Alternative to Hypervisors Stephen Soltesz, Herbert Potzl, Marc E. Fiuczynski,
The Stanford Clean Slate Program
Lecture Topics: 11/1 General Operating System Concepts Processes
Router Construction Outline Switched Fabrics IP Routers
Outline Chapter 2 (cont) OS Design OS structure
System calls….. C-program->POSIX call
CSE 471 Autumn 1998 Virtual memory
NetFPGA - an open network development platform
Chapter 13: I/O Systems.
Presentation transcript:

Eric Keller Oral General Exam 5/5/08 Multi-Level Architecture for Data Plane Virtualization

2 The Internet (and IP) Usage of Internet continuously evolving The way packets forwarded hasn’t (IP) –Meant for communication between machines –Address tied to fixed location –Hierarchical addressing –Best-effort delivery –Addresses easy to spoof Great innovation at the edge (Skype/VoIP, BitTorrent) –Programmability of hosts at application layer –Can’t add any functionality into network

3 Proposed Modifications Many proposals to modify some aspect of IP –No single one is best –Difficult to deploy Publish/Subscribe mechanism for objects –Instead of routing on machine address, route on object ID –e.g. DONA (Data oriented network architecture), scalable simulation Route through intermediary points –Instead of communication between machines –e.g. i3 (internet indirection infrastructure), DOA (delegation oriented architecture) Flat Addressing to separate location from ID –Instead of hierarchical based on location –e.g. ROFL (routing on flat labels), SEIZE (scalable and efficient, zero-configuration enterprise)

4 Challenges Want to Innovate in the Network –Can’t because networks are closed Need to lower barrier for who innovates –Allow individuals to create a network and define its functionality Virtualization as a possible solution –For both network of future and overlay networks –Programmable and sharable –Examples: PlanetLab, VINI

5 Network Virtualization Running multiple virtual networks at the same time over a shared physical infrastructure –Each virtual network composed of virtual routers having custom functionality Physical machine Virtual router Virtual network – e.g. blue virtual routers plus Blue links

6 Virtual Network Tradeoffs Performance Programmability Isolation Goal: Enable custom data planes per virtual network –Challenge: How to create the shared network nodes

7 Virtual Network Tradeoffs Performance Programmability Isolation Goal: Enable custom data planes per virtual network –Challenge: How to create the shared network nodes How easy is it to add new functionality? What is the range of new functionality that can be added? Does it extend beyond “software routers”?

8 Virtual Network Tradeoffs Performance Programmability Isolation Goal: Enable custom data planes per virtual network –Challenge: How to create the shared network nodes Does resource usage by one virtual networks have an effect on others? Faults? How secure is it given a shared substrate?

9 Virtual Network Tradeoffs Performance Programmability Isolation Goal: Enable custom data planes per virtual network –Challenge: How to create the shared network nodes How much overhead is there for sharing? What is the forwarding rate? Throughput? Latency?

10 Virtual Network Tradeoffs Network Containers –Duplicate stack or data structures –e.g. Trellis, OpenVZ, Logical Router Extensible Routers –Assemble custom routers from common functions –e.g. Click, Router Plug Ins, Scout Virtual Machines+Click –Run operating system on top of another operating system –e.g. Xen, PL-VINI (Linux-VServer) Programability, Isolation, Performance Programmability, Isolation, Performance Performance Programmability Isolation

11 Outline Architecture Implementation –Virtualizing Kernel –Challenges with kernel execution –Extending beyond commodity hardware Evaluation Conclusion/Future Work

12 Outline Architecture Implementation –Virtualizing Kernel –Challenges with kernel execution –Extending beyond commodity hardware Evaluation Conclusion/Future Work

13 Custom functionality –Custom user environment on each node (for controlling virtual router) –Specify single node’s packet handling as graph of common functions Isolated from others sharing same node –Allocated share of resources (e.g. CPU, memory, bandwidth) –Protected from faults in others (e.g. another virtual router crashing) Highest performance possible Config/Query interface User Control Environment User Experience (Creating a virtual network) A1A2A3 A4A5 To devices From devices Determine Shortest Path Populate routing tables Check Header, Destination Lookup For example…

14 Combine graphs into single graph –Provides lightweight virtualization Add extra packet processing (e.g. mux/demux) –Needed to direct packets to the correct graph Add resource accounting Lightweight Virtualization Master graph Graph 1 Graph 2 combine Graph 1 Graph 2 Input port Output port Master Graph

15 Increasing Performance and Isolation Partition into multiple graphs across multiple targets –Each target with different capabilities  Performance, Programmability, Isolation –Add connectivity between targets –Unified run-time interface (it appears as a single graph)  To query and configure the forwarding capabilities Master graph partition Target0 graph Target1 graph Target2 graph Graph 1 Graph 2 combine

16 Examples of Multi-Level Fast Path/Slow Path –IPv4: forwarding in fast path, exceptions in slow path –i3: Chord ring lookup function in fast path, handling requests in slow path Preprocessing –IPSec – do encryption/decryption in HW, rest in SW Offloading –TCP Offload –TCP Splicing Pipeline of coarse grain services –e.g. transcoding, firewall –SoftRouter from Bell Labs

17 Outline Architecture Implementation –Virtualizing Kernel –Challenges with kernel execution –Extending beyond commodity hardware Evaluation Conclusion/Future Work

18 Implementation Each network has custom functionality –Specified as graph of common functions –Click modular router Each network allocated share of resources –e.g. CPU –Linux-VServer – single resource accounting for both control and packet processing Each network protected from faults in others –Library of elements considered safe –Container for unsafe elements Highest performance possible –FPGA for modules with HW option, Kernel for modules without

19 Click Background: Overview Software architecture for building flexible and configurable routers –Widely used – commercially and in research –Easy to use, flexible, high performance (missing sharable) Routers assembled from packet processing modules (Elements) –Simple and Complex Processing is directed graph Includes a scheduler –Schedules tasks (a series of elements) FromDevice(eth0)DiscardCounter

20 Linux-VServer

21 Linux-VServer + Click + NetFPGA Click Coordinating Process Installer Click on NetFPGA click

22 Outline Architecture Implementation –Virtualizing Click in the Kernel –Challenges with kernel execution –Extending beyond software routers Evaluation Conclusion/Future Work

23 Virtual Kernel Mode Click Want to run in Kernel mode –Close to 10x higher performance than user mode Use library of ‘safe’ elements –Since Kernel is shared execution space Need resource accounting –Click scheduler does not do resource accounting –Want resource accounting system-wide (i.e. not just inside of packet processing)

24 Resource Accounting with VServer Purpose of Resource Accounting –Provides isolation between virtual networks Unified resource accounting –For packet processing and control VServer’s Token Bucket Extension to Linux Scheduler –Controls eligibility of processes/threads to run Integrating with Click –Each individual Click configuration assigned to its own thread –Each thread associated with VServer context  Basic mechanism is to manipulate the task_struct

25 Outline Architecture Implementation –Virtualizing Kernel –Challenges with kernel execution –Extending beyond software routers Evaluation Conclusion/Future Work

26 Unyielding Threads Linux kernel threads are cooperative (i.e. must yield) –Token scheduler controls when eligible to start Single long task can have short term disruptions –Affecting delay and jitter on other virtual networks Token bucket does not go negative –Long term, a virtual network can get more than its share Tokens added (rate A) Min tokens to exec (M) Tokens consumed (1 per scheduler tick) Size of Bucket (S)

27 Unyielding Threads (solution) Determine maximum allowable execution time –e.g. from token bucket parameters, network guarantees Determine pipeline’s execution time –Elements from library have known execution times –Custom elements execution times are unknown Break pipeline up (for known) Execute inside of container (for unknown) elem1elem2elem3 elem1elem2elem3 elem1elem2elem3 From Kern To User

28 Custom Elements Written in C++ Elements have access to global state –Kernel state/functions –Click global state Could… –Pre-compile in user mode –Pre-compile with restricted header files Not perfect: –With C++, you can manipulate pointers Instead, custom elements are unknown (“unsafe”) –Execute in container in user space

29 Outline Architecture Implementation –Virtualizing Kernel –Challenges with kernel execution –Extending beyond commodity hardware Evaluation Conclusion/Future Work

30 Extending beyond commodity HW PC + Programmable NIC (e.g. NetFPGA) –FPGA on PCI card –4 GigE ports –On board SRAM and DRAM Jon Turner’s “Pool of Processing Elements” – with crossbar –PEs can be GPP, NPU, FPGA –Switch Fabric = Crossbar Switch Fabric LC 1 PE 1 PE 2 LC 2 PE m LC n... Line Cards Processing Engines Partition between FPGA and Software Generalize: Partition among PEs

31 FPGA Click Two previous approach –Cliff – Click graph to Verilog, standard interface on modules –CUSP – Optimize Click graph by parallelizing internal statements. Our approach: –Build on Cliff by integrating FPGAs into Click (the tool) Software Analogies –Connection to outside environment –Packet Transfer –Element specification and implementation –Run-time querying and configuration –Memory –Notifiers –Annotations FromDevice (eth0) Element (LEN 5) ToDevice (eth0)

32 Outline Architecture Implementation –Virtualizing Kernel –Challenges with kernel execution –Extending beyond commodity hardware Evaluation Conclusion/Future Work

33 Experimental Evaluation Is multi-level the right approach? –i.e. is it worth effort to support kernel and FPGA –Does programmability imply less performance? What is the overhead of virtualization? –From container: when you need to go to user space. –From using multiple threads: when running in kernel. Are the virtual networks isolated in terms of resource usage? –What is the maximum short-term disruption from unyeilding threads? –How long can a task run without leading to long-term unfairness?

34 Setup PC3000 on Emulab 3GHz, 2GB RAM *Generates Packets from n0 to n1, tagged with time * Receives packets, diffs the current time and packet time (and stores avg in mem) n0 n1 n2 n3 rtr The router under test (Linux or a Click config) Modify header (IP and ETH) To be from n1 to n2.

35 Is multi-Level the right approach? Performance benefit going from user to kernel, and –Kernel to FPGA Programmability imply less performance? –Not sacrificing performance by introducing programmability

36 What is the overhead of virtualization? From container When you must go to user space, what is the cost of executing in a container? Overhead of executing in a VServer is minimal

37 What is the overhead of virtualization? From using multiple threads 4portRouter (compound element) RoundRobinPollDevice 4portRouter (compound element) ToDevice Thread (each runs X tasks/yield) Put same click graph in each thread Round robin traffic between them

38 How long to run before yielding # tasks per yield: –Low => high context switching, I/O executes often –High => low context switching, I/O executes infrequently

39 What is the overhead of virtualization? From using multiple threads Given sweet spot for each # of virtual networks –Increasing number of virtual networks from 1 to 10 does not hurt aggregate performance significantly Alternatives to consider –Single threaded with VServer –Single threaded, modify Click to do resource accounting –Integrate polling into threads

40 What is the maximum short-term disruption from unyeilding threads? Profile of (some) Elements –Standard N port router example - ~ 5400 cycles (1.8us) –RadixIPLookup (167k entries) - ~1000 cycles –Simple Elements  CheckLength - ~400 cycles  Counter - ~700 cycles  HashSwitch - ~450 cycles Maximum Disruption is length of longest task –Possible to break up pipelines RoundTrip CycleCount Infinite Source Elem Discard NoFree

41 Chewy How long can a task run without leading to long-term unfairness? 4portRouter (compound element) Count cycles Infinite Source Discard 4portRouter (compound element) Infinite Source Discard Limited to 15%

42 How long can a task run without leading to long-term unfairness? Tasks longer than 1 token can lead to unfairness Run long executing elements in user-space –performance overhead of user-space is not as big of an issue Zoomed In ~10k extra cycles / task

43 Outline Architecture Implementation –Virtualizing Kernel –Challenges with kernel execution –Extending beyond commodity hardware Evaluation Conclusion/Future Work

44 Conclusion Goal: Enable custom data planes per virtual network Tradeoffs –Performance –Isolation –Programmability Built a multi-level version of Click –FPGA –Kernel –Container

45 Future Work Scheduler –Investigate alternatives to improve efficiency Safety –Process to certify element as safe (can it be automated?) Applications –Deploy on VINI testbed –Virtual router migration HW/SW Codesign Problem –Partition decision making –Specification of elements (G language)

46 Questions Click! MultiLevel

47 Backup

48 Signs of Openness There are signs that network owners and equipment providers are opening up Peer-to-peer and network provider collaboration –Allowing intelligent selection of peers –e.g. Pando/Verizon (P4P), BitTorrent/Comcast Router Vendor API –allowing creation of software to run on routers –e.g. Juniper PSDP, Cisco AXP Cheap and easy access to compute power –Define functionality and communication between machines –e.g. Amazon EC2, Sun Grid

49 Example 1: User/Kernel Partition Execute “unsafe” elements in container –Add communication elements s1s2s3 u1 s1s2s3 tufu fktk User Kernel container u1 Safe (s1, s2, s3) Unsafe (u1) ToUser (tu), FromKernel (fk) ToKernel(tk), FromUser (fu)

50 Example 2: Non-Commodity HW PC + Programmable NIC (e.g. NetFPGA) –FPGA on PCI card –4 GigE ports –On board SRAM and DRAM Jon Turner’s “Pool of Processing Elements” – with crossbar –PEs can be GPP, NPU, FPGA –Switch Fabric = Crossbar Switch Fabric LC 1 PE 1 PE 2 LC 2 PE m LC n... Line Cards Processing Engines Partition between FPGA and Software Generalize: Partition among PEs

51 Example 2: Non-Commodity HW Redrawing the picture for FPGA/SW… –Elements can have HW implementation, SW implementation, or both (choose one) hw1hw2hw3 sw1 hw1hw2hw3 tcfc fdtd Software FPGA sw1 ToCPU (tc), FromDevice (fd) ToDevice(td), FromCPU (fc) Software (sw1) Hardware (hw1, hw2, hw3)

52 Connection to outside environment In Linux, the “Board” is set of devices (e.g. eth0) –Can query Linux for what’s available –Network driver (to read/write packets) –Inter process communication (for comm with handlers) FPGA is a chip on a board –Using “eth0” needs  Pins to connect to  Some on chip logic (in form of IP Core) Board API –Specify available devices –Specify size of address block - used by char driver –Provide elaborate() function  Generates a top level Verilog module  Generates a UCF file (pin assignments)

53 Packet Transfer In software it is a function call In FPGA use a pipeline of elements with a standard interface Option1: Stream packet through, 1 word at a time –Could just be the header –Push/Pull a bit tricky Option2: Pass pointer –But would have to go to memory (inefficient) Element1Element2 data ctrl valid ready

54 Element specification and implementation Need –Meta-data –Specify packet processing –Specify run-time querying handling (next slide) Meta-data –Use Click C++ API –Ports –Registers to use specific devices  e.g. FromDevice(eth0) registers to use eth0 Packet Processing –Use C++ to print out Verilog  Specialized based on instantiation parameters (config. string) –Standard interface for packet –Standard interface for handler  Currently memory mapped register

55 Run-time querying and configuration Query state and update configuration in elements –e.g. “add ADDR/MASK [GW] OUT” When Creating Element –Request Addr Block –Specify software handlers –Uses read/write methods to get data Allocating Addresses –Given total size, and –size of each elements requested block Generating Decode Logic click char driver telnet decode elem1elem2elem3 PCI kernel user FPGA

56 Memory In software –malloc –static arrays –Share table through global variables or passing pointer –Elements that do no packet processing (passed as configuration string to elements) In FPGA –Elements have local memory (registers/BRAM) –Unshared (off-chip) memories – treat like a device –Shared (off-chip) global memories (Unimplemented)  Globally shared vs. Shared between subset of elements –Elements that do no packet processing (Unimplemented)

57 Notifiers, Annotations Notifiers –Element registers as listener or notifier –In FPGA, create extra signal(s) from notifier to listener Annotations –Extra space in Packet data structure –Used to mark packet with info not in packet  Which input port packet arrived in  Result of lookup –In software  fixed byte array –In FPGA  packet is streamed through, so adding extra bytes is simple

58 User/Kernel Communication Add communication elements –Use mknod for each direction –ToUser/FromUser store packets and provide file functions –ToKernel/FromKernel use file I/O s1s2s3 u1 s1s2s3 tufu fktk User Kernel container u1 Safe (s1, s2, s3) Unsafe (u1) ToUser (tu), FromKernel (fk) ToKernel(tk), FromUser (fu)

59 FPGA/Software Communication Add communication elements –ToCPU/FromCPU uses device that communicates with Linux over PCI bus –Network driver in Linux –ToDevice/FromDevice – standard Click element hw1hw2hw3 sw1 hw1hw2hw3 tcfc fdtd Software FPGA sw1 ToCPU (tc), FromDevice (fd) ToDevice(td), FromCPU (fc) Software (sw1) Hardware (hw1, hw2, hw3)