High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Slides:



Advertisements
Similar presentations
Low Latency Messaging Over Gigabit Ethernet Keith Fenech CSAW 24 September 2004.
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula,
04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Interactions Between Delayed Acks and Nagle’s Algorithm in HTTP and HTTPS: Problems and Solutions Arthur Goldberg Robert Buff New York University March.
INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
TCP Servers: Offloading TCP/IP Processing in Internet Servers
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
LWIP TCP/IP Stack 김백규.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
TCP Throughput Collapse in Cluster-based Storage Systems
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
ND The research group on Networks & Distributed systems.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.
The Mach System Silberschatz et al Presented By Anjana Venkat.
Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Studies of LHCb Trigger Readout Network Design Karol Hennessy University College Dublin Karol Hennessy University College Dublin.
KAIST CORE LAB. Chul Lee Performance Issues in WWW Servers Erich Nahum, Tsipora Barzilai, and Dilip Kandlur IBM T.J Watson Research Center SIGMETRICS Feb.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Introduction to threads
High Performance and Reliable Multicast over Myrinet/GM-2
CS703 - Advanced Operating Systems
Cluster Computers.
Presentation transcript:

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University D.K. Panda Ohio State University Pete Wyckoff Ohio Supercomputer Center

Presentation Overview  Background and Motivation  Design Challenges  Performance Enhancement Techniques  Performance Results  Conclusions

Background and Motivation Sockets Sockets Frequently used API Frequently used API Traditional Kernel-Based Implementation Traditional Kernel-Based Implementation Unable to exploit High Performance Networks Unable to exploit High Performance Networks Earlier Solutions Earlier Solutions Interrupt Coalescing Interrupt Coalescing Checksum Offload Checksum Offload Insufficient Insufficient It gets worse with 10 Gigabit Networks It gets worse with 10 Gigabit Networks Can we do better Can we do better User-level support User-level support

Kernel Based Implementation of Sockets NIC IP TCP Sockets Application or Library Hardware Kernel User Space Pros High Compatibility  Cons Kernel Context Switches Multiple Copies CPU Resources

Alternative Implementations of Sockets (GigaNet cLAN) “VI aware” NIC IP TCP Sockets Application or Library Hardware Kernel User Space Pros High Compatibility  Cons Kernel Context Switches Multiple Copies CPU Resources IP-to-VI layer

Sockets over User-Level Protocols Sockets is a generalized protocol Sockets is a generalized protocol Sockets over VIA Sockets over VIA Developed by Intel Corporation [shah98] and ET Research Institute [sovia01] Developed by Intel Corporation [shah98] and ET Research Institute [sovia01] GigaNet cLAN platform GigaNet cLAN platform Most networks in the world are Ethernet Most networks in the world are Ethernet Gigabit Ethernet Gigabit Ethernet Backward compatible Backward compatible Gigabit Network over the existing installation base Gigabit Network over the existing installation base MVIA: Version of VIA on Gigabit Ethernet MVIA: Version of VIA on Gigabit Ethernet Kernel Based Kernel Based A need for a High Performance Sockets layer over Gigabit Ethernet A need for a High Performance Sockets layer over Gigabit Ethernet

User-Level Protocol over Gigabit Ethernet Ethernet Message Passing (EMP) Protocol Ethernet Message Passing (EMP) Protocol Zero-Copy OS-Bypass NIC-driven User-Level protocol over Gigabit Ethernet Zero-Copy OS-Bypass NIC-driven User-Level protocol over Gigabit Ethernet Developed over the Dual-processor Alteon NICs Developed over the Dual-processor Alteon NICs Complete Offload of message passing functionality to the NIC Complete Offload of message passing functionality to the NIC Piyush Shivam, Pete Wyckoff, D.K. Panda, “EMP: Zero-Copy OS-bypass NIC- driven Gigabit Ethernet Message Passing”, Supercomputing, November ’01 Piyush Shivam, Pete Wyckoff, D.K. Panda, “Can User-Level Protocols take advantage of Multi-CPU NICs?”, IPDPS, April ‘02

EMP: Latency A base latency of 28  s compared to an ~120  s of TCP for 4-byte messages

EMP: Bandwidth Saturated the Gigabit Ethernet network with a peak bandwidth of 964Mbps

Proposed Solution Gigabit Ethernet NIC Sockets over EMP Application or Library Hardware Kernel User Space Kernel Context Switches Multiple Copies CPU Resources High Performance OS Agent EMP Library

Presentation Overview  Background and Motivation  Design Challenges  Performance Enhancement Techniques  Performance Results  Conclusions

Design Challenges  Functionality Mismatches  Connection Management  Message Passing  Resource Management  UNIX Sockets

Functionality Mismatches and Connection Management Functionality Mismatches Functionality Mismatches No API for buffer advertising in TCP No API for buffer advertising in TCP Connection Management Connection Management Data Message Exchange Data Message Exchange Descriptors required for connection management Descriptors required for connection management

Message Passing Message Passing Message Passing Data Streaming Data Streaming Parts of the same message can be read potentially to different buffers Parts of the same message can be read potentially to different buffers Unexpected Message Arrivals Unexpected Message Arrivals Separate Communication Thread Separate Communication Thread Keeps track of used descriptors and re-posts Keeps track of used descriptors and re-posts Polling Threads have high Synchronization cost Polling Threads have high Synchronization cost Sleeping Threads involve OS scheduling granularity Sleeping Threads involve OS scheduling granularity Rendezvous Approach Rendezvous Approach Eager with Flow Control Eager with Flow Control

Rendezvous Approach SenderReceiver SQRQSQRQ send() receive() Request ACK Data

Eager with Flow Control SenderReceiver SQRQSQRQ send() Data ACK Data receive()

Resource Management and UNIX Sockets Resource Management Resource Management Clean up unused descriptors (connection management) Clean up unused descriptors (connection management) Free registered memory Free registered memory UNIX Sockets UNIX Sockets Function Overriding Function Overriding Application Changes Application Changes File Descriptor Tracking File Descriptor Tracking

Presentation Overview  Background and Motivation  Design Challenges  Performance Enhancement Techniques  Performance Results  Conclusions

Performance Enhancement Techniques  Credit Based Flow Control  Disabling Data Streaming  Delayed Acknowledgments  EMP Unexpected Queue

Credit Based Flow Control SenderReceiver SQRQSQRQ Credits Left: 4Credits Left: 3Credits Left: 2Credits Left: 1Credits Left: 0Credits Left: 4 Multiple Outstanding Credits

Non-Data Streaming and Delayed Acknowledgments Disabling Data Streaming Disabling Data Streaming Intermediate copy required for Data Streaming Intermediate copy required for Data Streaming Place data directly into user buffer Place data directly into user buffer Delayed Acknowledgments Delayed Acknowledgments Increase in Bandwidth Increase in Bandwidth Lesser Network Traffic Lesser Network Traffic NIC has lesser work to do NIC has lesser work to do Decrease in Latency Decrease in Latency Lesser descriptors posted Lesser descriptors posted Lesser Tag Matching at the NIC Lesser Tag Matching at the NIC 550ns per descriptor 550ns per descriptor

EMP Unexpected Queue EMP Unexpected Queue EMP Unexpected Queue EMP features unexpected message queue EMP features unexpected message queue Advantages: Last to be checked Advantages: Last to be checked Disadvantage: Data Copy Disadvantage: Data Copy Acknowledgments in the Unexpected Queue Acknowledgments in the Unexpected Queue No copy, since acknowledgments carry no data No copy, since acknowledgments carry no data Acknowledgments pushed out of the critical path Acknowledgments pushed out of the critical path

Presentation Overview  Background and Motivation  Design Challenges  Performance Enhancement Techniques  Performance Results  Conclusions

Performance Results Micro-benchmarks Micro-benchmarks Latency (ping-pong) Latency (ping-pong) Bandwidth Bandwidth FTP Application FTP Application Web Server Web Server HTTP/1.0 Specifications HTTP/1.0 Specifications HTTP/1.1 Specifications HTTP/1.1 Specifications

Experimental Test-bed  Four Pentium III 700Mhz Quads  1GB Main Memory  Alteon NICs  Packet Engine Switch  Linux version

Micro-benchmarks: Latency  Up to 4 times improvement compared to TCP  Overhead of 0.5us compared to EMP

Micro-benchmarks: Bandwidth  An improvement of 53% compared to enhanced TCP

FTP Application  Up to 2 times improvement compared to TCP

Web Server (HTTP/1.0)  Up to 6 times improvement compared to TCP

Web Server (HTTP/1.1)  Up to 3 times improvement compared to TCP

Conclusions Developed a High Performance User-Level Sockets implementation over Gigabit Ethernet Developed a High Performance User-Level Sockets implementation over Gigabit Ethernet Latency close to base EMP (28  s) Latency close to base EMP (28  s) 28.5  s for Non-Data Streaming 28.5  s for Non-Data Streaming 37  s for Data Streaming sockets 37  s for Data Streaming sockets 4 times improvement in latency compared to TCP 4 times improvement in latency compared to TCP Peak Bandwidth of 840Mbps Peak Bandwidth of 840Mbps 550Mbps obtained by TCP with increased Registered space for the kernel (up to 2MB) 550Mbps obtained by TCP with increased Registered space for the kernel (up to 2MB) Default case is 340Mbps with 32KB Default case is 340Mbps with 32KB Improvement of 53% Improvement of 53%

Conclusions (contd.) FTP Application shows an improvement of nearly 2 times FTP Application shows an improvement of nearly 2 times Web Server shows tremendous performance improvement Web Server shows tremendous performance improvement HTTP/1.0 shows an improvement of up to 6 times HTTP/1.0 shows an improvement of up to 6 times HTTP/1.1 shows an improvement of up to 3 times HTTP/1.1 shows an improvement of up to 3 times

Future Work Dynamic Credit Allocation Dynamic Credit Allocation NIC: The trusted component NIC: The trusted component Integrated QoS Integrated QoS Currently on Myrinet Clusters Currently on Myrinet Clusters Commercial applications in the Data Center environment Commercial applications in the Data Center environment Extend the idea to next generation interconnects Extend the idea to next generation interconnects InfiniBand InfiniBand 10 Gigabit Ethernet 10 Gigabit Ethernet

For more information, please visit the Network Based Computing Laboratory, The Ohio State University Thank You NBC Home Page