I/O Acceleration in Server Architectures

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Chapter 8 Hardware Conventional Computer Hardware Architecture.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.

Multicore Architectures. Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison.

CSE 124 Networked Services Fall 2009 B. S. Manoj, Ph.D 10/27/20091CSE 124 Networked Services Fall 2009 Some.

Architectural Impact of SSL Processing Jingnan Yao.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

5/8/2006 Nicole SAN Protocols 1 Storage Networking Protocols Nicole Opferman CS 526.

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.

Computer System Architectures Computer System Software

Scalable Networking for Next-Generation Computing Platforms Yoshio Turner *, Tim Brecht *‡, Greg Regnier §, Vikram Saletore §, John Janakiraman *, Brian.

Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group

Mapping of scalable RDMA protocols to ASIC/FPGA platforms

Lecture 1 Introduction to Application Oriented Networking.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.

Lecture 3 Review of Internet Protocols Transport Layer.

High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

© 2010 IBM Corporation Plugging the Hypervisor Abstraction Leaks Caused by Virtual Networking Alex Landau, David Hadas, Muli Ben-Yehuda IBM Research –

Make Hosts Ready for Gigabit Networks. Hardware Requirement To allow a host to fully utilize Gbps bandwidth, its hardware system must be ready for Gbps.

SpliceNP: A TCP Splicer using a Network Processor Li Zhao +, Yan Luo*, Laxmi Bhuyan University of California Riverside Ravi Iyer Intel Corporation + Now.

1 Next Few Classes Networking basics Protection & Security.

Penn State CSE “Optimizing Network Virtualization in Xen” Aravind Menon, Alan L. Cox, Willy Zwaenepoel Presented by : Arjun R. Nath.

Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Srihari Makineni & Ravi Iyer Communications Technology Lab

Computer Security Workshops Networking 101. Reasons To Know Networking In Regard to Computer Security To understand the flow of information on the Internet.

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Guangdeng Liao, Xia Zhu, Steen Larsen, Laxmi Bhuyan, Ram Huggahalli University of California, Riverside Intel Labs.

OS Services And Networking Support Juan Wang Qi Pan Department of Computer Science Southeastern University August 1999.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

IBM Haifa Research Lab © IBM Corporation IsoStack – Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran,

ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.

ND The research group on Networks & Distributed systems.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Using Heterogeneous Paths for Inter-process Communication in a Distributed System Vimi Puthen Veetil Instructor: Pekka Heikkinen M.Sc.(Tech.) Nokia Siemens.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.

Full and Para Virtualization

1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Sandeep Singhal, Ph.D Director Windows Core Networking Microsoft Corporation.

Technical Overview of Microsoft’s NetDMA Architecture Rade Trimceski Program Manager Windows Networking & Devices Microsoft Corporation.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

CS 286 Computer Organization and Architecture

Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson

Internetworking: Hardware/Software Interface

Storage Networking Protocols

Xen Network I/O Performance Analysis and Opportunities for Improvement

ECE 671 – Lecture 8 Network Adapters.

Presentation transcript:

I/O Acceleration in Server Architectures Laxmi N. Bhuyan University of California, Riverside http://www.cs.ucr.edu/~bhuyan

Acknowledgement Many slides in this presentation have been taken (or modified from) from Li Zhao’s Ph.D. dissertations at UCR and Ravi Iyer’s (Intel) presentation at UCR. The research has been supported by NSF, UC Micro and Intel Research.

Enterprise Workloads Key Characteristics Throughput-Oriented Lots of transactions, operations, etc in flight Many VMs, processes, threads, fibers, etc Scalability and Adaptability are key Rich (I/O) Content TCP, SoIP, SSL, XML High Throughput Requirements Efficiency and Utilization are key

Rich I/O Content in the Enterprise Trends Increasing layers of processing on I/O data Business critical functions (TCP, IP storage, security, XML etc.) Independent of actual application processing Exacerbate by high network rates High rates of I/O Bandwidth with new technologies PCI-Express technology 10/s Gb to 40 Gb/s network technologies and it just keeps going App XML SSL iSCSI TCP/IP Platform Network Data

Network Protocols TCP/IP protocols OSI Reference Model 4 layers App3 layers 7 6 5 4 3 2 1 Application Presentation Session Transport Network Data Link Physical OSI TCP/IP Examples 4 3 2 1 Application HTTP, Telnet, SSL, XML HTTP, Telnet XML SSL TCP, UDP IP, IPSec, ICMP Transport Internet Link Ethernet, FDDI Ethernet, FDDI Coax, Signaling SSL and XML are in Session Layer

Situation even worse with virtualization

Virtualization Overhead: Server Consolidation Server consolidation is when both the guests run on same physical hardware Server-to-Server Latency & Bandwidth comparison under 10Gbps ===>

Communicating with the Server: The O/S Wall Problems: O/S overhead to move a packet between network and application level => Protocol Stack (TCP/IP) O/S interrupt Data copying from kernel space to user space and vice versa Oh, the PCI Bottleneck! CPU User Kernel NIC PCI Bus

The Send Operation The application writes the transmit data to the TCP/IP sockets interface for transmission in payload sizes ranging from 4 KB to 64 KB. The data is copied from the User space to the Kernel space The OS segments the data into maximum transmission unit (MTU)–size packets, and then adds TCP/IP header information to each packet. The OS copies the data onto the network interface card (NIC) send queue. The NIC performs the direct memory access (DMA) transfer of each data packet from the TCP buffer space to the NIC, and interrupts CPU activities to indicate completion of the transfer.

Transmit/Receive data using a standard NIC Note: Receive path is longer than Send due to extra copying

Where do the cycles go?

Network Bandwidth is Increasing TCP requirements Rule of thumb: 1GHz for 1Gbps 1000 100 100 Network bandwidth outpaces Moore’s Law 40 10 10 GHz and Gbps The gap between the rate of processing network applications and the fast growing network bandwidth is increasing 1 0.1 Moore’s Law .01 1990 1995 2000 2003 2005 2006/7 2010 Time

I/O Acceleration Techniques TCP Offload: Offload TCP/IP Checksum and Segmentation to Interface hardware or programmable device (Ex. TOEs) – A TOE-enabled NIC using Remote Direct Memory Access (RDMA) can use zero-copy algorithms to place data directly into application buffers. O/S Bypass: User-level software techniques to bypass protocol stack – Zero Copy Protocol (Needs programmable device in the NIC for direct user level memory access – Virtual to Physical Memory Mapping. Ex. VIA)

Comparing standard TCP/IP and TOE enabled TCP/IP stacks (http://www.dell.com/downloads/global/power/1q04-her.pdf)

Design of a Web Switch Using IXP 2400 Network Processor Internet GET /cgi-bin/form HTTP/1.1 Host: www.site.com… APP. DATA TCP IP Same problems with AONs, programmable routers, web switches. Requests going through network, IP and TCP layers Our Solution: Bring TCP connection establishment and processing down to the network level using NP Ref: L. Bhuyan, “A Network Processor Based, Content Aware Switch”, IEEE Micro, May/June 2006, (with L. Zhao and Y. Luo). Application level Processing

Ex; Design of a Web Switch Using Intel IXP 2400 NP (IEEE Micro June/July 2006) www.yahoo.com Internet Image Server IP TCP APP. DATA Application Server GET /cgi-bin/form HTTP/1.1 Host: www.yahoo.com… Switch HTML Server

But Our Concentration in this talk But Our Concentration in this talk! Server Acceleration Design server (CPU) architectures to speed up protocol stack processing! Also Focus on TCP/IP

Profile of a Packet Simulation Results. Run Free BSD on Simplescalar. No System Overheads Descriptor & Header Accesses IP Processing Computes TCB Accesses TCP Processing Memory Copy Memory Total Avg Clocks / Packet: ~ 21K Effective Bandwidth: 0.6 Gb/s (1KB Receive)

Five Emerging Technologies Optimized Network Protocol Stack (ISSS+CODES, 2003) Cache Optimization (ISSS+CODES, 2003, ANCHOR, 2004) Network Stack Affinity Scheduling Direct Cache Access (DCA) Lightweight Threading Memory Copy Engine (ICCD 2005 and IEEE TC)

Cache Optimizations

Instruction Cache Behavior Higher requirement on L1 I-cache size due to the program structure Benefit more from larger line size, higher degree of set associativity

Execution Time Analysis Given a total L1 cache size on the chip, more area should be devoted to I-Cache and less to D-cache

Direct Cache Access (DCA) Normal DMA Writes Direct Cache Access CPU Cache Memory NIC Memory Controller Step 1 DMA Write Step 2 Cache Update Step 3 CPU Read CPU Step 4 CPU Read Cache Step 2 Snoop Invalidate Memory Controller Memory Step 1 DMA Write Step 3 Memory Write NIC Eliminate 3 to 25 memory accesses by placing packet data directly into cache

Memory Copy Engines L.Bhuyan, “Hardware Support for Bulk Data Movement in Server Platforms”, ICCD, October 2005 (IEEETC, 2006), with L. Zhao, et.al.

Memory Overhead Simulation NIC descriptors Mbufs TCP/IP headers Payload

Copy Engines Copy is time-consuming due to Copy engine can CPU moves data at small granularity Source or destination is in memory (not cache) Memory accesses clog up resources Copy engine can Fast copies and reducing CPU resource occupancy Copies can be done in parallel with the CPU computation Avoid cache pollution and reduce interconnect traffic Low overhead communication between the engine & the CPU Hardware support to allow the engine to run asynchronously with the CPU Hardware support to share the virtual address between the engine and the CPU Low overhead signaling of completion

Asynchronous Low-Cost Copy (ALCC) Today, memory to memory data copies require CPU execution Build a copy engine and tightly couple it with the CPU Low communication overhead; asynchronous execution w.r.t CPU App Processing Memory Copy App Processing App Processing App Processing App Processing Memory Copy Continue computing during memory to memory copies

Performance Evaluation

Total I/O Acceleration

Potential Efficiencies (10X) Benefits of Architectural Technques Ref: Greg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004 On CPU, multi-gigabit, line speed network I/O is possible

CPU-NIC Integration

CPU-NIC Integration Performance Comparison (RX) with Various Connections in SUN Niagra 2 Machine INIC performs better than DNIC with greater than 16 Connections.

Latency Comparison INIC can achieve a lower latency by saving 6 µs. It is due to the smaller latency of accessing I/O registers and eliminating PCI-E bus latency.

Current and Future Work Architectural characteristics and Optimization TCP/IP Optimization, CPU+NIC Integration, TCB Cache Design, Anatomy and Optimization of Driver Software Caching techniques, ISA optimization, Data Copy engines Simulator Design Similar analysis with virtualization and 10 GE with multi-core CPUs ongoing with Intel project. Core Scheduling in Multicore Processors TCP/IP Scheduling on multi-core processors Application level Cache-Aware and Hash-based Scheduling Parallel/Pipeline Scheduling to simultaneously address throughput and latency Scheduling to minimize power consumption Similar research with Virtualization Design and analyisis of Heterogeneous Multiprocessors – Heterogeneous Chip multiprocessors -- Use of Network Processors, GPUs and FPGA’s

THANK YOU