Download presentation
Presentation is loading. Please wait.
Published byMichael Richards Modified over 9 years ago
1
TCP Servers: Offloading TCP/IP Processing in Internet Servers
Liviu Iftode Department of Computer Science University of Maryland and Rutgers University
2
My Research: Network-Centric Systems
TCP Servers and Split-OS [NSF CAREER] Migratory TCP and Service Continuations Federated File Systems Smart Messages [NSF ITR-2] and Spatial Programming for Networks of Embedded Systems
3
Networking and Performance
IP Network TCP WAN Internet Servers S S Storage Networks SAN IP or not IP ? TCP or not TCP? D D D The transport-layer protocol must be efficient
4
The Scalability Problem
Apache web server on 1 Way and 2 Way 300 MHz Intel Pentium II SMP repeatedly accessing a static16 KB file
5
Breakdown of CPU Time for Apache
6
The TCP/IP Stack APPLICATION SYSTEM CALLS SEND
copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler hardware_interrupt_handler packet_in KERNEL packet_out
7
Breakdown of CPU Time for Apache
8
Serialized Networking Actions
APPLICATION SYSTEM CALLS SEND copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA packet_out RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler hardware_interrupt_handler packet_in Serialized Operations
9
TCP/IP Processing is Very Expensive
Protocol processing can take up to 70% of the CPU cycles For Apache web server on uniprocessors [Hu 97] Can lead to Receive Livelock [Mogul 95] Interrupt handling consumes a significant amount of time Soft Timers [Aron 99] Serialization affects scalability
10
Outline Motivation TCP Offloading using TCP Server
TCP Server for SMP Servers TCP Server for Cluster-based Servers Prototype Evaluation
11
TCP Offloading Approach
Offload network processing from application hosts to dedicated processors/nodes/I-NICs Reduce OS intrusion network interrupt handling context switches serializations in the networking stack cache and TLB pollution Should adapt to changing load conditions Software or hardware solution?
12
The TCP Server Idea CLIENT SERVER TCP/IP OS FAST COMMUNICATION
Host Processor TCP Server TCP/IP Application OS CLIENT FAST COMMUNICATION SERVER
13
TCP Server Performance Factors
Efficiency of the TCP server implementation event-based server, no interrupts Efficiency of communication between host(s) and TCP server non-intrusive, low-overhead API asynchronous, zero-copy Adaptiveness to load
14
TCP Servers for Multiprocessor Systems
CPU 0 CPU N TCP Server Application Host OS CLIENT SHARED MEMORY Multiprocessor (SMP) Server
15
TCP Servers for Clusters with Memory-to-Memory Interconnects
Host TCP Server Application CLIENT MEMORY-to-MEMORY INTERCONNECT Cluster-based Server
16
TCP Servers for Multiprocessor Servers
17
SMP-based Implementation
TCP Server Application Host OS IO APIC Disk & Other Interrupts Network and Clock Interrupts Interrupts
18
SMP-based Implementation (cont’d)
TCP Server Application Host OS ENQUEUE SEND REQUEST DEQUEUE AND EXECUTE SEND REQUEST SHARED QUEUE
19
TCP Server Event-Driven Architecture
Dispatcher Monitor Send Handler Receive Handler Asynchronous Event Handler Shared Queue NIC From Application Processors To Application Processors
20
Dispatcher Kernel thread executing at the highest priority level in the kernel Schedules different handlers based using input from the monitor Executes an infinite loop and does not yield the processor No other activity can execute on the TCP Server processor
21
Asynchronous Event Handler (AEH)
Handles asynchronous network events Interacts with the NIC Can be an Interrupt Service Routine or a Polling Routine Is a short running thread Has the highest priority among TCP server modules The clock interrupt is used as a guaranteed trigger for the AEH when polling
22
Send and Receive Handlers
Scheduled in response to a request in the Shared Memory queues Run at the priority of the network protocol Interact with the Host processors
23
Monitor Observes the state of the system queues and provides hints to the Dispatcher to schedule Used for book-keeping and dynamic load balancing Scheduled periodically or when an exception occurs Queue overflow or empty Bad checksum for a network packet Retransmissions on a connection Can be used to reconfigure the set of TCP servers in response to load variation
24
TCP Servers for Cluster-based Servers
25
Cluster-based Implementation
TCP Server Host Application Socket Stub TUNNEL SOCKET REQUEST DEQUEUE AND EXECUTE SOCKET REQUEST VI Channels
26
TCP Server Architecture
Eager Processor Resource Manager TCP/IP Provider VI Connection Handler Request Handler Socket Call Processor SAN NIC - WAN (To Host)
27
Sockets and VI Channels
Pool of VI’s created at initialization Avoid cost of creating VI’s in the critical path Registered memory regions associated with each VI Send and receive buffers associated with socket Also used to exchange control data Socket mapped to a VI on the first socket operation All subsequent operations on the socket tunneled through the same VI to the TCP server
28
Socket Call Processing
Host library intercepts socket call Socket call parameters are tunneled to the TCP server over a VI channel TCP server performs socket operation and returns results to the host Library returns control to the application immediately or when the socket call completes (asynchronous vs synchronous processing).
29
Design Issues for TCP Servers
Splitting of the TCP/IP processing Where to split? Asynchronous event handling Interrupt or polling? Asynchronous API Event scheduling and resource allocation Adaptation to different workloads
30
Prototypes and Evaluation
31
SMP-based Prototype Modified Linux – SMP kernel on Intel x86 platform to implement TCP server Most parts of the system are kernel modules, with small inline changes to the TCP stack, software interrupt handlers and the task structures Instrumented the kernel using on-chip performance monitoring counters to profile the system
32
Evaluation Testbed Server Clients NIC : 3-Com 996-BT Gigabit Ethernet
4-Way 550MHz Intel Pentium II Xeon system with 1GB DRAM and 1MB on chip L2 cache Clients 4-way SMPs 2-Way 300 MHz Intel Pentium II system with 512 MB RAM and 256KB on chip L2 cache NIC : 3-Com 996-BT Gigabit Ethernet Server Application: Apache web server Client program: sclients [Banga 97] Trace driven execution of clients
33
Trace Characteristics
Logs Number of files Average file size Number of requests Average reply size Forth 11931 19.3 KB 400335 8.8 KB Rutgers 18370 27.3 KB 498646 19.0 KB Synthetic 128 16.0 KB 50000
34
Splitting TCP/IP Processing
APPLICATION APPLICATION PROCESSORS SYSTEM CALLS SEND copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA packet_out RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler interrupt_handler packet_in C3 C2 DEDICATED PROCESSORS C1
35
Implementations Implementation Interrupt processing (C1)
Receive Bottom (C2) Send Bottom (C3) Avoiding Interrupts (S1) SMP_BASE SMP_C1C2 SMP_C1C2S1 SMP_C1C2C3 SMP_C1C2C3S1
36
Throughput
37
CPU Utilization for Synthetic Trace
38
Throughput Using Synthetic Trace With Dynamic Content
39
Adapting TCP Servers to Changing Workloads
Monitor the queues Identify low and high water marks to change the size of the processor set Execute a special handler for exceptional events Queue length lower than the low water mark Set a flag which dispatcher checks Dispatcher sleeps if the flag is set Reroute the interrupts Queue length higher than the high water mark Wake up the dispatcher on the chosen processor
40
Load behaviour and dynamic reconfiguration
41
Throughput with Dynamic Reconfiguration
42
Cluster-based Prototype
User-space implementation (bypass host kernel) Entire socket operation offloaded to TCP Server C1, C2 and C3 offloaded by default Optimizations Asynchronous processing: AsyncSend Processing ahead: Eager Receive, Eager Accept Avoiding data copy at host using pre-registered buffers requires different API: MemNet
43
Implementations Implementation Kernel Bypassing (H1) Asynchronous
Processing (H2) Avoiding Host Copies (H3) Ahead (S2) Cluster_base Cluster_C1C2C3H1 Cluster_C1C2C3H1H3 Cluster_C1C2C3H1H2H3 Cluster_C1C2C3H1H2H3S2
44
Evaluation Testbed Server Clients NIC: 3-Com 996-BT Gigabit Ethernet
Host and TCP Server: 2-Way 300 MHz Intel Pentium II system with 512 MB RAM and 256KB on chip L2 cache Clients 4-Way 550MHz Intel Pentium II Xeon system with 1GB DRAM and 1MB on chip L2 cache NIC: 3-Com 996-BT Gigabit Ethernet Server application: Custom web server Flexibility in modifying application to use our API Client program: httperf
45
Throughput with Synthetic Trace Using HTTP/1.0
46
CPU Utilization
47
Throughput with Synthetic Trace Using HTTP/1.1
48
Throughput with Real Trace (Forth) Using HTTP/1.0
49
Related Work TCP Offloading Engines
Communication Services Platform (CSP) System architecture for scalable cluster-based servers, using a VIA-based SAN to tunnel TCP/IP packets inside the cluster Piglet - A vertical OS for multiprocessors Queue Pair IP - A new end point mechanism for inter-network processing inspired from memory-to-memory communication
50
Conclusions Offloading networking functionality to a set of dedicated TCP servers yields up to 30% performance improvement Performance Essentials: TCP Server architecture event driven polling instead of interrupts adaptive to load API asynchronous, zero-copy
51
Future Work TCP Server software distributions
Compare TCP Server Architecture with hardware based offloading schemes Use TCP Servers in Storage Networking
52
Acknowledgements My graduate students:
Murali Rangarajan, Aniruddha Bohra and Kalpana Banerjee
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.