TCP Servers: Offloading TCP/IP Processing in Internet Servers

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
Chapter 7 Protocol Software On A Conventional Processor.
ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.
1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
3.5 Interprocess Communication
Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented by Reinette Grobler.
OS Spring’03 Introduction Operating Systems Spring 2003.
I/O Systems CS 3100 I/O Hardware1. I/O Hardware Incredible variety of I/O devices Common concepts ◦Port ◦Bus (daisy chain or shared direct access) ◦Controller.
1 Split-OS An Operating System Architecture for Clusters of Intelligent Devices Aniruddha Bohra, Kalpana Banerjee Suresh Gopalakrishnan, Murali Rangarajan.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Split-OS: Operating System Architecture for a Cluster of Intelligent Devices Kalpana Banerjee, Aniruddha Bohra, Suresh Gopalakrishnan, Murali Rangarajan.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services by, Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Chapter 13: I/O Systems. 13.2/34 Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Middleware Services. Functions of Middleware Encapsulation Protection Concurrent processing Communication Scheduling.
TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Oindrila.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
SEDA An architecture for Well-Conditioned, scalable Internet Services Matt Welsh, David Culler, and Eric Brewer University of California, Berkeley Symposium.
An Efficient Threading Model to Boost Server Performance Anupam Chanda.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Chapter 13: I/O Systems.
Module 12: I/O Systems I/O hardware Application I/O Interface
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
CSCI 315 Operating Systems Design
I/O Systems I/O Hardware Application I/O Interface
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Module 12: I/O Systems I/O hardwared Application I/O Interface
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

TCP Servers: Offloading TCP/IP Processing in Internet Servers Liviu Iftode Department of Computer Science University of Maryland and Rutgers University

My Research: Network-Centric Systems TCP Servers and Split-OS [NSF CAREER] Migratory TCP and Service Continuations Federated File Systems Smart Messages [NSF ITR-2] and Spatial Programming for Networks of Embedded Systems http://discolab.rutgers.edu

Networking and Performance IP Network TCP WAN Internet Servers S S Storage Networks SAN IP or not IP ? TCP or not TCP? D D D The transport-layer protocol must be efficient

The Scalability Problem Apache web server on 1 Way and 2 Way 300 MHz Intel Pentium II SMP repeatedly accessing a static16 KB file

Breakdown of CPU Time for Apache

The TCP/IP Stack APPLICATION SYSTEM CALLS SEND copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler hardware_interrupt_handler packet_in KERNEL packet_out

Breakdown of CPU Time for Apache

Serialized Networking Actions APPLICATION SYSTEM CALLS SEND copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA packet_out RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler hardware_interrupt_handler packet_in Serialized Operations

TCP/IP Processing is Very Expensive Protocol processing can take up to 70% of the CPU cycles For Apache web server on uniprocessors [Hu 97] Can lead to Receive Livelock [Mogul 95] Interrupt handling consumes a significant amount of time Soft Timers [Aron 99] Serialization affects scalability

Outline Motivation TCP Offloading using TCP Server TCP Server for SMP Servers TCP Server for Cluster-based Servers Prototype Evaluation

TCP Offloading Approach Offload network processing from application hosts to dedicated processors/nodes/I-NICs Reduce OS intrusion network interrupt handling context switches serializations in the networking stack cache and TLB pollution Should adapt to changing load conditions Software or hardware solution?

The TCP Server Idea CLIENT SERVER TCP/IP OS FAST COMMUNICATION Host Processor TCP Server TCP/IP Application OS CLIENT FAST COMMUNICATION SERVER

TCP Server Performance Factors Efficiency of the TCP server implementation event-based server, no interrupts Efficiency of communication between host(s) and TCP server non-intrusive, low-overhead API asynchronous, zero-copy Adaptiveness to load

TCP Servers for Multiprocessor Systems CPU 0 CPU N TCP Server Application Host OS CLIENT SHARED MEMORY Multiprocessor (SMP) Server

TCP Servers for Clusters with Memory-to-Memory Interconnects Host TCP Server Application CLIENT MEMORY-to-MEMORY INTERCONNECT Cluster-based Server

TCP Servers for Multiprocessor Servers

SMP-based Implementation TCP Server Application Host OS IO APIC Disk & Other Interrupts Network and Clock Interrupts Interrupts

SMP-based Implementation (cont’d) TCP Server Application Host OS ENQUEUE SEND REQUEST DEQUEUE AND EXECUTE SEND REQUEST SHARED QUEUE

TCP Server Event-Driven Architecture Dispatcher Monitor Send Handler Receive Handler Asynchronous Event Handler Shared Queue NIC From Application Processors To Application Processors

Dispatcher Kernel thread executing at the highest priority level in the kernel Schedules different handlers based using input from the monitor Executes an infinite loop and does not yield the processor No other activity can execute on the TCP Server processor

Asynchronous Event Handler (AEH) Handles asynchronous network events Interacts with the NIC Can be an Interrupt Service Routine or a Polling Routine Is a short running thread Has the highest priority among TCP server modules The clock interrupt is used as a guaranteed trigger for the AEH when polling

Send and Receive Handlers Scheduled in response to a request in the Shared Memory queues Run at the priority of the network protocol Interact with the Host processors

Monitor Observes the state of the system queues and provides hints to the Dispatcher to schedule Used for book-keeping and dynamic load balancing Scheduled periodically or when an exception occurs Queue overflow or empty Bad checksum for a network packet Retransmissions on a connection Can be used to reconfigure the set of TCP servers in response to load variation

TCP Servers for Cluster-based Servers

Cluster-based Implementation TCP Server Host Application Socket Stub TUNNEL SOCKET REQUEST DEQUEUE AND EXECUTE SOCKET REQUEST VI Channels

TCP Server Architecture Eager Processor Resource Manager TCP/IP Provider VI Connection Handler Request Handler Socket Call Processor SAN NIC - WAN (To Host)

Sockets and VI Channels Pool of VI’s created at initialization Avoid cost of creating VI’s in the critical path Registered memory regions associated with each VI Send and receive buffers associated with socket Also used to exchange control data Socket mapped to a VI on the first socket operation All subsequent operations on the socket tunneled through the same VI to the TCP server

Socket Call Processing Host library intercepts socket call Socket call parameters are tunneled to the TCP server over a VI channel TCP server performs socket operation and returns results to the host Library returns control to the application immediately or when the socket call completes (asynchronous vs synchronous processing).

Design Issues for TCP Servers Splitting of the TCP/IP processing Where to split? Asynchronous event handling Interrupt or polling? Asynchronous API Event scheduling and resource allocation Adaptation to different workloads

Prototypes and Evaluation

SMP-based Prototype Modified Linux – 2.4.9 SMP kernel on Intel x86 platform to implement TCP server Most parts of the system are kernel modules, with small inline changes to the TCP stack, software interrupt handlers and the task structures Instrumented the kernel using on-chip performance monitoring counters to profile the system

Evaluation Testbed Server Clients NIC : 3-Com 996-BT Gigabit Ethernet 4-Way 550MHz Intel Pentium II Xeon system with 1GB DRAM and 1MB on chip L2 cache Clients 4-way SMPs 2-Way 300 MHz Intel Pentium II system with 512 MB RAM and 256KB on chip L2 cache NIC : 3-Com 996-BT Gigabit Ethernet Server Application: Apache 1.3.20 web server Client program: sclients [Banga 97] Trace driven execution of clients

Trace Characteristics Logs Number of files Average file size Number of requests Average reply size Forth 11931 19.3 KB 400335 8.8 KB Rutgers 18370 27.3 KB 498646 19.0 KB Synthetic 128 16.0 KB 50000

Splitting TCP/IP Processing APPLICATION APPLICATION PROCESSORS SYSTEM CALLS SEND copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA packet_out RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler interrupt_handler packet_in C3 C2 DEDICATED PROCESSORS C1

Implementations Implementation Interrupt processing (C1) Receive Bottom (C2) Send Bottom (C3) Avoiding Interrupts (S1) SMP_BASE SMP_C1C2  SMP_C1C2S1 SMP_C1C2C3 SMP_C1C2C3S1

Throughput

CPU Utilization for Synthetic Trace

Throughput Using Synthetic Trace With Dynamic Content

Adapting TCP Servers to Changing Workloads Monitor the queues Identify low and high water marks to change the size of the processor set Execute a special handler for exceptional events Queue length lower than the low water mark Set a flag which dispatcher checks Dispatcher sleeps if the flag is set Reroute the interrupts Queue length higher than the high water mark Wake up the dispatcher on the chosen processor

Load behaviour and dynamic reconfiguration

Throughput with Dynamic Reconfiguration

Cluster-based Prototype User-space implementation (bypass host kernel) Entire socket operation offloaded to TCP Server C1, C2 and C3 offloaded by default Optimizations Asynchronous processing: AsyncSend Processing ahead: Eager Receive, Eager Accept Avoiding data copy at host using pre-registered buffers requires different API: MemNet

Implementations Implementation Kernel Bypassing (H1) Asynchronous Processing (H2) Avoiding Host Copies (H3) Ahead (S2) Cluster_base Cluster_C1C2C3H1  Cluster_C1C2C3H1H3 Cluster_C1C2C3H1H2H3 Cluster_C1C2C3H1H2H3S2

Evaluation Testbed Server Clients NIC: 3-Com 996-BT Gigabit Ethernet Host and TCP Server: 2-Way 300 MHz Intel Pentium II system with 512 MB RAM and 256KB on chip L2 cache Clients 4-Way 550MHz Intel Pentium II Xeon system with 1GB DRAM and 1MB on chip L2 cache NIC: 3-Com 996-BT Gigabit Ethernet Server application: Custom web server Flexibility in modifying application to use our API Client program: httperf

Throughput with Synthetic Trace Using HTTP/1.0

CPU Utilization

Throughput with Synthetic Trace Using HTTP/1.1

Throughput with Real Trace (Forth) Using HTTP/1.0

Related Work TCP Offloading Engines Communication Services Platform (CSP) System architecture for scalable cluster-based servers, using a VIA-based SAN to tunnel TCP/IP packets inside the cluster Piglet - A vertical OS for multiprocessors Queue Pair IP - A new end point mechanism for inter-network processing inspired from memory-to-memory communication

Conclusions Offloading networking functionality to a set of dedicated TCP servers yields up to 30% performance improvement Performance Essentials: TCP Server architecture event driven polling instead of interrupts adaptive to load API asynchronous, zero-copy

Future Work TCP Server software distributions Compare TCP Server Architecture with hardware based offloading schemes Use TCP Servers in Storage Networking

Acknowledgements My graduate students: Murali Rangarajan, Aniruddha Bohra and Kalpana Banerjee http://discolab.rutgers.edu