Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Slides:

Advertisements

Similar presentations

Conserving Disk Energy in Network Servers ACM 17th annual international conference on Supercomputing Presented by Hsu Hao Chen.

Advertisements

Assessment of Data Path Implementations for Download and Streaming Pål Halvorsen 1,2, Tom Anders Dalseng 1 and Carsten Griwodz 1,2 1 Department of Informatics,

NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.

COREY: AN OPERATING SYSTEM FOR MANY CORES

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

IO-Lite: A Unified Buffering and Caching System By Pai, Druschel, and Zwaenepoel (1999) Presented by Justin Kliger for CS780: Advanced Techniques in Caching.

1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.

1 Measurement and Analysis of LDAP Performance Xin Wang( ), Henning Schulzrinne, Dilip Kandlur, Dinesh Verma.

1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.

1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.

Introduction to Systems Architecture Kieran Mathieson.

Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.

Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.

EE 122: Router Design Kevin Lai September 25, 2002.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.

Evaluating System Performance in Gigabit Networks King Fahd University of Petroleum and Minerals (KFUPM) INFORMATION AND COMPUTER SCIENCE DEPARTMENT Dr.

Split-OS: Operating System Architecture for a Cluster of Intelligent Devices Kalpana Banerjee, Aniruddha Bohra, Suresh Gopalakrishnan, Murali Rangarajan.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.

Router Architectures An overview of router architectures.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.

Router Architectures An overview of router architectures.

Department of Computer Science Southern Illinois University Edwardsville Dr. Hiroshi Fujinoki and Kiran Gollamudi {hfujino,

Software Routers: NetSlice Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking October 15,

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.

Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.

Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group

1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.

Design and Performance Evaluation of Networked Storage Architectures Xubin He July 25,2002 Dept. of Electrical and Computer Engineering.

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Lecture 3 Review of Internet Protocols Transport Layer.

The NE010 iWARP Adapter Gary Montry Senior Scientist

RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

Computer Organization & Assembly Language © by DR. M. Amer.

ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.

1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.

TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.

CS 4396 Computer Networks Lab Router Architectures.

ND The research group on Networks & Distributed systems.

Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

1 Xin Wang Internet Real -Time Laboratory Internet Real -Time Laboratory Columbia University ( Joint work with Henning Schulzrinne, Dilip Kandlur, and.

Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.

Infiniband Architecture

CS 286 Computer Organization and Architecture

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Directory-based Protocol

High Performance Computing

Network Systems and Throughput Preservation

Presentation transcript:

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture Group

2 Anatomy of a Web Request  Static content web server CPU Network Interface Main Memory Interconnect Request File Headers Request Headers File Headers File Network Request File Headers File Headers File Request File Headers Request 95 % utilization

3 Problem  Inefficient use of local interconnect –Repeated transfers –Transfer every bit of data sent out to network  Local interconnect bottleneck  Transfer overhead exacerbates inefficiency –Overhead reduces available bandwidth –E.g. Peripheral Component Interconnect (PCI) 30 % transfer overhead

4 Solution  Network interface data caching –Cache data in network interface –Reduces interconnect traffic –Software-controlled cache –Minimal changes to the operating system  Prototype web server –Up to 57% reduction in PCI traffic –Up to 31% increase in server performance –Peak 1571 Mb/s of content throughput Breaks PCI bottleneck

5 Outline  Background  Network Interface Data Caching  Implementation  Experimental Prototype / Results  Summary

6 Network Interface Cache Network Interface Data Cache  Software-controlled cache in network interface CPU Main Memory Interconnect Request File Headers Request Headers File X Network

7 Web Traces  Five web traces –Realistic working set / file distribution  Berkeley computer science department  IBM  NASA Kennedy Space Center  Rice computer science department  1998 World Cup

8 Content Locality  Block cache with 4KB block size 8-16MB caches capture locality

9 Outline  Background  Network Interface Data Caching  Implementation –OS modification / NIC API  Experimental Prototype / Results  Summary

10 Unmodified Operating System  Transmit data flow Device Driver Network Stack File Page 1. Identify pages Page 2. Protocol processing Break into packets Packet Page Packet Page 3. Inform network interface Packet Page

11 Modified Operating System  OS completely controls network interface data cache  Minimal changes to the OS Device Driver Network Stack File Page 1. Identify pages (Unmodified) Cache Directory Page 2. Annotate (New step) Page 3. Protocol processing Break into packets (Unmodified) Packet Page Packet Page 4. Query directory (New step) Packet Page 5. Inform network interface (Unmodified) Packet Page

12 Operating System Modification  Device Driver –Completely controls cache –Makes allocation/use/replacement decisions  Cache directory (in device driver) –An entry is a tuple of file identifier offset within file file revision number flags –Sufficient to maintain cache coherence

13 Network Interface API  Initialize  Insert data into the cache  Append data to a packet  Append cached data to a packet TX Buffer Cache Main Memory Network Interface Inter- connect Append Append cached data

14 Outline  Background  Network Interface Data Caching  Implementation  Experimental Prototype / Results  Summary

15 Prototype Server  Athlon processor, 2GB RAM  64-bit, 33 MHz PCI bus (2 Gb/s)  Two Gigabit Ethernet NICs (4 Gb/s) –Based on programmable Tigon 2 controller –Firmware implements new API  FreeBSD 4.6 –850 lines of new code/150 lines of kernel changes  thttpd web server –High performance lightweight web server –Supports zero-copy sendfile

16 Results: PCI Traffic ~1260 Mb/s is limit! ~60 % Content traffic PCI saturated 60 % utilization 1198 Mb/s of HTTP content 30 % Overhead

17 Results: PCI Traffic Reduction Low temporal reuse Low PCI utilization Good temporal reuse CPU bottleneck % reduction with four traces

18 Results: World Cup Temporal reuse (84 %) PCI utilization (69 %) 57 % traffic reduction 7% throughput increase 794 Mb/s w/o caching 849 Mb/s w/ caching CPU bottleneck

19 Results: Rice Temporal reuse (40 %) PCI utilization (91 %) 40 % traffic reduction 17% throughput increase 1126 Mb/s w/o caching 1322 Mb/s w/ caching Breaks PCI bottleneck

20 Results: NASA Temporal reuse (71 %) PCI utilization (95 %) 54 % traffic reduction 31% throughput increase 1198 Mb/s w/o caching 1571 Mb/s w/ caching Break PCI bottleneck

21 Summary  Network interface data caching –Exploits web request locality –Network protocol independent –Interconnect architecture independent –Minimal changes to OS  36-57% reductions in PCI traffic  7-31% increase in server performance  Peak 1571Mb/s of content throughput –Surpasses PCI bottleneck