Netslice: Enabling Critical Network Infrastructure with Commodity Routers Prof. Hakim Weatherspoon, Cornell University Joint with Tudor Marian, Ki Suh Lee TRUST Autumn 2010 Conference, Stanford University November 10, 2010
Commodity Datacenters Datacenters are becoming a commodity Unit of replacement Datacenter in a box: already set up with commodity hardware & software (Intel, Linux, petabyte of storage) Plug network, power & cooling and turn on typically connected via optical fiber may have network of such datacenters
Commodity Datacenters Titan tech boom, randy katz, /10/2010Critical Network Infrastructure, by Hakim Weatherspoon
IBM Visit, Critical Infrastructure, by Hakim Weatherspoon Network Of Globally Distributed Datacenters Cloud Computing—Datacenters interconnected via fiber Long Fat Networks (LFN) or λ -networks Packet processors and extensible routers—middleboxes Increase functionality, performance, reliability, and security of network E.g. DPI, IDS, PEP, protocol accelerators, overlay routers, multimedia servers, security appliances, and network monitors, etc packet processor middleboxes packet processor middleboxes 11/10/2010 4
Network Of Globally Distributed Datacenters Packet processors Maelstrom [NSDI’08,TONS’10], SMFS [FAST’09] FEC,TCP-Split, De-duplication, Network-sync OS abstractions for packet processing in user-space Netslice Lambda networks Cornell NLR Rings [DSN10] SDNA/BiFocals [IMC’10] Maelstrom Netslice SDNA/Bifocals TCP-Split SMFS Cornell NLR Rings 5Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Network Of Globally Distributed Datacenters Packet processors Maelstrom [NSDI’08,TONS’10], SMFS [FAST’09] FEC,TCP-Split, De-duplication, Network-sync OS abstractions for packet processing in user-space Netslice Lambda networks Cornell NLR Rings [DSN10] SDNA/BiFocals [IMC’10] 6Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Challenges Large traffic volume processed / second (10Gb/s) Typical packet processors realized in hardware Trade off flexibility, programmability for speed Goal: Improve datacenter communication Packet processors and extensible routers Abundance of commodity servers readily available Commodity hardware Software Proprietary specialized hardware 7Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Takeaway Raw socket cannot take advantage of multicore/multiqueue Need new OS abstraction to take advantage of parallelism 9.7G 2.25G 8Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Outline—Packet Processing Abstractions The case for user-space packet processors What’s wrong with the raw socket in multicore environment? Hardware and software overheads Contention and lack of application control of resources Need new OS abstraction to take advantage of parallelism Netslice Evaluate Group intruduction Conclude 9Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
The Case Against Low-level Packet Processors Idiosyncrasies of memory allocator Small virtual address spaces Inability to swap out pages Limit on contiguous memory chunks Execution contexts and preemptive precedence Interrupt, bottom half, task/user context Synchronization primitives Tightly coupled with execution contexts (e.g. can I block?) Lack of development tools Lack of fault isolation A bug in the kernel is lethal Hardware Operating System Kernel Network Stack Application User-space Application 10Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Opportunity: exploit hardware parallelism Overheads Contention, contention, contention! Memory wall OS Design Overheads System calls Context switches Scheduling Blocking High-level Packet Processors: Where Have All My Cycles Gone? CPU Memory 11Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Contention: Amdahl’s Law Bounds maximum expected parallelism speedup Fraction P of a program parallelized to run on N CPUs The speedup is Serial fraction (1-P) Parallel fraction (P) Program: 1 Serial execution + Parallel execution = 12Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Contention: Cache-coherent Architecture Effects of cache coherency & memory accesses Cores read and write blocks of data concurrently Commodity system: Xeon X533 with 4MB L2 cache 13Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Contention: Peripheral NIC Slow cores fast network interface cards (NICs) More cores exhibit contention and overheads Hardware transmit/receive multi-queue support tx/rx queues 14Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Software overheads in the Conventional Network Stack Raw socket: all traffic from all NICs to user-space Hardware and software are loosely coupled Applications have no (end-to-end) control over resources tx/rx queues Network Stack Application Raw socket Network Stack Network Stack Network Stack Network Stack Network Stack Network Stack Network Stack Network Stack Application Network Stack 15Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Software overheads in the Conventional Network Stack API too general, hence complex network stack too bloated Raw sockets, end-point sockets, files: all the same Path taken by a packet is unnecessarily expensive Hides information from applications Limited functionality: least common denominator API Inefficient API: issue one system call per packet Network Stack Application Raw socket Application TCP socket UDP socket AF_UNIX socket Application File API TCP socket Application File access 16Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Netslice—user-space packet processor Give power to the application Packet processing in user-space Four-pronged approach (high level) Contention prevention Spatially partition hardware End-to-end control Provide fine-grained control over hardware Streamline path for packets Export a rich, efficient, backwards compatible API Hardware Operating System Kernel Network Stack Netslice Application User-space Application 17Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Netslice Spatial Partitioning Contention prevention Independent (parallel) execution contexts Split each Network Interface Controller (NIC) One NIC queue per NIC per context Group and split the CPU cores Implicit resources (bus and memory bandwidth) Temporal partitioning (time-sharing) Spatial partitioning (exclusive-access) 18Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Netslice Spatial Partitioning Example 2x quad core Intel Xeon X5570 (Nehalem) Two simultaneous hyperthreads – OS sees 16 CPUs Non Uniform Memory Access (NUMA) QuickPath point-to-point interconnect Shared L3 cache 19Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Fine-grained Hardware Control End-to-end control App controls NIC queue and CPU slice allocation NIC hardware interrupt routing & NIC queue Kernel execution context User-space execution context Tight coupling of software and hardware components 20Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Streamlined Path for Packets Inefficient conventional network stack One network stack “to rule them all” Performs too many memory accesses Pollutes cache, context switches, synchronization, system calls, blocking API 21Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Netslice API Expresses fine-grained hardware control Flexible: based on ioctl ioctl(fd, NETSLICE_CPUMASK_GET, &mask); sched_setaffinity(getpid(), sizeof(cpu_set_t), &mask.u_peer); Backwards compatible (read/write) fd = open("/dev/netslice-1", O_RDWR); read(fd, iov, IOVS) Efficient: batch send/receive (read/write) Amortize overhead of protection domain crossing 22Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Experimental Setup R710 packet processors dual socket quad core 2.93GHz Xeon X5570 (Nehalem) 8MB of shared L3 cache and 12GB of RAM 6GB connected to each of the two CPU sockets Two Myri-10G NICs R900 client end-hosts four socket 2.40GHz Xeon E7330 (Penryn) 6MB of L2 cache and 32GB of RAM 23Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Netslice Evaluation Compare against state-of-the-art RouteBricks in-kernel, Click & pcap-mmap user-space Additional baseline scenario All traffic through single NIC queue (receive-livelock) What is the basic forwarding performance? How efficient is the streamlining of one Netslice? What is the benefit of batching? How is Netslice scaling with the number of cores? Can build high-speed complex packet processors? 24Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Simple Packet Routing End-to-end throughput, MTU (1500 byte) packets Error bars (always present) denote standard error of mean 9.7G 2.25G 7.6G 7.5G 5.6G 74% of kernel 1/11 of Netslice 25Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Single Netslice user and kernel context CPU placement There are several choices 26Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Linear Scaling with CPUs # of CPUs used IPsec with 128 bit key—typically used by VPN – AES encryption in Cipher-block Chaining mode 9.1G 8G 27Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Netslice Implementation of Maelstrom In-kernel reference version: 8432 lines of C Netslice version: 1197 lines of user-space C Forwarding throughput: ±37.25 Mbps Maelstrom/Netslice goodput: ±35.7 Mbps 27.27% FEC overhead (for r=8, c=3) Mbps × (1 + c ⁄ (r+c)) = 8901 Mbps 28Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
10Gbps and beyond Netslice Flexible API and spatial partitioning Nehalem CPUs FSB not the bottleneck any longer Multiqueue NICs Each core carves a private slice of every NIC Batching Userspace multi-read / multi-write instead of ossified conventional read / write Traditional tricks Pin down memory, minimize LLC contention, etc. 29Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Conclusion Network layer is fundamental to Datacenter operations Packet processors enhance network functionality and performance Improve network performance with software packet processors running on commodity servers in userspace OS support to build packet processing applications Harness implicit parallelism of modern hardware to scale Solution completely portable; kernel module load at runtime 30Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Paper Trail Theme: “Datacenter Middleboxes” BiFocals/SDNA in IMC-2010 NLR study in DSN-2010 SMFS in FAST-2009 Maelstrom (FEC) in TONS-2010 and NSDI-2008 FWP, NSDI-2008 Poster Session More at 31Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010
Questions 32Critical Network Infrastructure, by Hakim Weatherspoon11/10/2010