1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.

Slides:



Advertisements
Similar presentations
Scheduling in Web Server Clusters CS 260 LECTURE 3 From: IBM Technical Report.
Advertisements

Supercharging PlanetLab A High Performance,Multi-Alpplication,Overlay Network Platform Reviewed by YoungSoo Lee CSL.
Spring 2000CS 4611 Introduction Outline Statistical Multiplexing Inter-Process Communication Network Architecture Performance Metrics.
Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
CS162 Section Lecture 9. KeyValue Server Project 3 KVClient (Library) Client Side Program KVClient (Library) Client Side Program KVClient (Library) Client.
Scalable Content-aware Request Distribution in Cluster-based Network Servers Jianbin Wei 10/4/2001.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
Introduction to Content-aware Switch Presented by Li Zhao.
CSE 190: Internet E-Commerce Lecture 16: Performance.
1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
Page: 1 Director 1.0 TECHNION Department of Computer Science The Computer Communication Lab (236340) Summer 2002 Submitted by: David Schwartz Idan Zak.
Load Balancing in Web Clusters CS 213 LECTURE 15 From: IBM Technical Report.
IXP2400 Protocol Offloading Yan Luo Chris Baron
1 Improving Web Servers performance Objectives:  Scalable Web server System  Locally distributed architectures  Cluster-based Web systems  Distributed.
4/22/2003 Network Processor & Its Applications1 Network Processor and Applications Prof. Laxmi Bhuyan
Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.
TCP Splicing for URL-aware Redirection
1 Web Proxies Dr. Rocky K. C. Chang 6 November 2005.
NPCSlli 1 DESIGN AND IMPLEMENTATION OF CONTENT SWITCH ON IXP1200EB Presenter: Longhua Li Committee Members: Dr. C. Edward Chow Dr. Jugal K. Kalita Dr.
Architectural Impact of SSL Processing Jingnan Yao.
Design and Implementation of Web Switch
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Intel IXP1200 Network Processor q Lab 12, Introduction to the Intel IXA q Jonathan Gunner, Sruti.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.
©UCR CS 260 Lecture 1: Introduction to Network Processors Instructor: L.N. Bhuyan
Shyamal Pandya Implementation of Network Processor Packet Filtering and Parameterization for Higher Performance Network Processors 1 Implementation of.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Proxy Cache Leonid Romanovsky Olga Fomenko Winter 2003 Instructor: Konstantin Sinyuk.
Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
Network Processors and Web Servers CS 213 LECTURE 17 From: IBM Technical Report.
I/O Acceleration in Server Architectures
Paper Review Building a Robust Software-based Router Using Network Processors.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group
LWIP TCP/IP Stack 김백규.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Lecture 3 Review of Internet Protocols Transport Layer.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
SpliceNP: A TCP Splicer using a Network Processor Li Zhao +, Yan Luo*, Laxmi Bhuyan University of California Riverside Ravi Iyer Intel Corporation + Now.
Data and Computer Communications Circuit Switching and Packet Switching.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
On the Performance of TCP Splicing for URL-aware Redirection Ariel Cohen, Sampath Rangarajan, and Hamilton Slye The 2 nd USENIX Symposium on Internet Technologies.
Web Cache Redirection using a Layer-4 switch: Architecture, issues, tradeoffs, and trends Shirish Sathaye Vice-President of Engineering.
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,
Srihari Makineni & Ravi Iyer Communications Technology Lab
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
IXP Lab 2012: Part 1 Network Processor Brief. NCKU CSIE CIAL Lab2 Outline Network Processor Intel IXP2400 Processing Element Register Memory Interface.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Networking Basics CCNA 1 Chapter 11.
TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.
An Efficient Threading Model to Boost Server Performance Anupam Chanda.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Introduction to Content-aware Switch Presented by Li Zhao.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
SCTP Handoff for Cluster Servers
VIRTUAL SERVERS Presented By: Ravi Joshi IV Year (IT)
Enhance Features and Performance of Content Switches
Lec 11 – Multicore Architectures and Network Processors
Instructor: L.N. Bhuyan CS 213 Computer Architecture Lecture 7: Introduction to Network Processors Instructor: L.N. Bhuyan.
Presentation transcript:

1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi Iyer Intel Corporation

2 Outline Motivation Background Design and Implementation Measurement Results Conclusions

3 Content-aware Switch Switch Image Server Application Server HTML Server Internet GET /cgi-bin/form HTTP/1.1 Host: APP. DATATCPIP Front-end of a web cluster, one VIP Route packets based on layer 5 information Examine application data in addition to IP& TCP Advantages over layer 4 switches Better load balancing: distribute packets based on content type Faster response: exploit cache affinity Better resource utilization: partition database

4 Processing Elements in Content- aware Switches ASIC (Application Specific Integrated Circuit) High processing capacity Long time to develop Lack the flexibility GP (General-purpose Processor) Programmable Cannot provide satisfactory performance due to overheads on interrupt, moving packets through PCI bus, ISA not optimized for networking applications NP (Network Processor) Operate at the link layer of the protocol, optimized ISA for packet processing, multiprocessing and multithreading  high performance Programmable so that they can achieve flexibility

5 Outline Motivation Background NP architecture Mechanism to build a content-aware switch Design and Implementation Measurement Results Conclusion

6 Background on NP Hardware Control processor (CP): embedded general purpose processor, maintain control information Data processors (DPs): tuned specifically for packet processing Communicate through shared DRAM NP operation Packet arrives in receive buffer Header Processing Transfer the packet to transmit buffer CP DP

7 Mechanisms to Build a CS Switch TCP gateway An application level proxy Setup 1 st connection w/ client, parses request  server, setup 2 nd connection w/ server Copy overhead TCP splicing Reduce the copy overhead Forward packet at network level between the network interface driver and the TCP/IP stack Two connections are spliced together Modify fields in IP and TCP header kernel user kernel user

8 TCP Splicing client content switch server step1 step2 SYN(CSEQ) SYN(DSEQ) ACK(CSEQ+1) DATA(CSEQ+1) ACK(DSEQ+1) step3 step7 step8 step4 step5 step6 SYN(CSEQ) SYN(SSEQ) ACK(CSEQ+1) DATA(CSEQ+1) ACK(SSEQ+1) DATA(SSEQ+1) ACK(CSEQ+lenR+1) DATA(DSEQ+1) ACK(CSEQ+LenR+1) ACK(DSEQ+lenD+1) ACK(SSEQ+lenD+1) lenR: size of http request. lenD: size of return document.

9 Outline Motivation Background Design and Implementation Discussion on design options Resource allocation Processing on MEs Measurement Results Conclusion

10 Design Options Option 0: GP-based (Linux-based) switch Option 1: CP setup & and splices connections, DPs process packets sent after splicing Connection setup & splicing is more complex than data forwarding Packets before splicing need to be passed through DRAM queues Option 2: DPs handle connection setup, splicing & forwarding

11 IXP 2400 Block Diagram XScale core Microengines(MEs) 2 clusters of 4 microengines each Each ME run up to 8 threads 16KB instruction store Local memory Scratchpad memory, SRAM & DRAM controllers ME Scratch Hash CSR IX bus interface SRAM controller XScale SDRAM controller PCI

12 Resource Allocation Client-side control block list: record states for connections between clients and switch, states for forwarding data packets after splicing Server-side control block list: record state for connections between server and switch URL table: select a back-end server for an incoming request Client port Server port

13 Processing on MEs Control packets SYN HTTP request Data packets Response ACK

14 Outline Motivation Background Design and Implementation Measurement Results Conclusion

15 Experimental Setup Radisys ENP2611 containing an IXP2400 XScale & ME: 600MHz 8MB SRAM and 128MB DRAM Three 1Gbps Ethernet ports: 1 for Client port and 2 for Server ports Server: Apache web server on an Intel 3.0GHz Xeon processor Client: Httperf on a 2.5GHz Intel P4 processor Linux-based switch Loadable kernel module 2.5GHz P4, two 1Gbps Ethernet NICs

16 Measurement Results Latency reduced significantly 83.3% (0.6ms  1KB The larger the file size, the higher the reduction 1MB file

17 Analysis – Three Factors Linux-basedNP-based Interrupt: NIC raises an interrupt once a packet comes polling NIC-to-mem copy Xeon 3.0Ghz Dual processor w/ 1Gbps Intel Pro 1000 (88544GC) NIC, 3 us to copy a 64-byte packet by DMA No copy: Packets are processed inside w/o two copies Linux processing: OS overheads Processing a data packet in splicing state: 13.6 us IXP processing: Optimized ISA 6.5 us

18 Measurement Results Throughput is increased significantly 5.7x for small file 1KB, 2.2x for large 1MB Higher improvement for small files Latency reduction for control packets > data packets Control packets take a larger portion for small files

19 An Alternative Implementation SRAM: control blocks, hash tables, locks Can become a bottleneck when thousands of connections are processed simultaneously; Not possible to maintain a large number due to its size limitation DRAM: control blocks, SRAM: hash table and locks Memory accesses can be distributed more evenly to SRAM and DRAM, their access can be pipelined; increase the # of control blocks that can be supported

20 Measurement Results Fix request file 64 KB, increase the request rate 665.6Mbps vs Mbps

21 Conclusions Designed and implemented a content-aware switch using IXP2400 Analyzed various tradeoffs in implementation and compared its performance with a Linux- based switch Measurement results show that NP-based switch can improve the performance significantly

22 Backups

23 TCP Splicing client content switch server step1 step2 SYN(CSEQ) SYN(DSEQ) ACK(CSEQ+1) DATA(CSEQ+1) ACK(DSEQ+1) step3 step7 step8 step4 step5 step6 SYN(CSEQ) SYN(SSEQ) ACK(CSEQ+1) DATA(CSEQ+1) ACK(SSEQ+1) DATA(SSEQ+1) ACK(CSEQ+lenR+1) DATA(DSEQ+1) ACK(CSEQ+LenR+1) ACK(DSEQ+lenD+1) ACK(SSEQ+lenD+1) lenR: size of http request. lenD: size of return document.

24 TCP Handoff client content switch server step1 step2 SYN(CSEQ) SYN(DSEQ) ACK(CSEQ+1) DATA(CSEQ+1) ACK(DSEQ+1) step3 step4 step5 step6 DATA(DSEQ+1) ACK(CSEQ+lenR+1) ACK(DSEQ+lenD+1) ACK(DSEQ+lenD+1) Migrate (Data, CSEQ, DSEQ) Migrate the created TCP connection from the switch to the back-end sever –Create a TCP connection at the back-end without going through the TCP three-way handshake –Retrieve the state of an established connection and destroy the connection without going through the normal message handshake required to close a TCP connection Once the connection is handed off to the back-end server, the switch must forward packets from the client to the appropriate back-end server