Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.
Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.
04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.
Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.
Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.
Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.
August 02, 2004Mallikarjun Chadalapaka, HP1 iSCSI/RDMA: Overview of DA and iSER Mallikarjun Chadalapaka HP.
Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Internetworking Fundamentals (Lecture #2) Andres Rengifo Copyright 2008.
Router Architectures An overview of router architectures.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
CECS 474 Computer Network Interoperability Tracy Bradley Maples, Ph.D. Computer Engineering & Computer Science Cal ifornia State University, Long Beach.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Router Architectures An overview of router architectures.
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
Protocols and the TCP/IP Suite Chapter 4. Multilayer communication. A series of layers, each built upon the one below it. The purpose of each layer is.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia Distributed Research Group.
Semester 1 Module 8 Ethernet Switching Andres, Wen-Yuan Liao Department of Computer Science and Engineering De Lin Institute of Technology
1 The SpaceWire Internet Tunnel and the Advantages It Provides For Spacecraft Integration Stuart Mills, Steve Parkes Space Technology Centre University.
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.
IWARP Redefined: Scalable Connectionless Communication Over High-Speed Ethernet M. J. Rashti, R. E. Grant, P. Balaji and A. Afsahi.
ISO Layer Model Lecture 9 October 16, The Need for Protocols Multiple hardware platforms need to have the ability to communicate. Writing communications.
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,
Mapping of scalable RDMA protocols to ASIC/FPGA platforms
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
LWIP TCP/IP Stack 김백규.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
1 Chapter 16 Protocols and Protocol Layering. 2 Protocol  Agreement about communication  Specifies  Format of messages (syntax)  Meaning of messages.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Fundamentals of Computer Networks ECE 478/578 Lecture #19: Transport Layer Instructor: Loukas Lazos Dept of Electrical and Computer Engineering University.
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
Switching breaks up large collision domains into smaller ones Collision domain is a network segment with two or more devices sharing the same Introduction.
Salim Hariri HPDC Laboratory Enhanced General Switch Management Protocol Salim Hariri Department of Electrical and Computer.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Srihari Makineni & Ravi Iyer Communications Technology Lab
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Chapter 4 Version 1 Virtual LANs. Introduction By default, switches forward broadcasts, this means that all segments connected to a switch are in one.
Interconnection network network interface and a case study.
Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.
Reduced Communication Protocol for Clusters Clunix Inc. Donghyun Kim
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Progress in Standardization of RDMA technology Arkady Kanevsky, Ph.D Chair of DAT Collaborative.
Ch 3. Transport Layer Myungchul Kim
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Computer Networking A Top-Down Approach Featuring the Internet Introduction Jaypee Institute of Information Technology.
Tgt: Framework Target Drivers FUJITA Tomonori NTT Cyber Solutions Laboratories Mike Christie Red Hat, Inc Ottawa Linux.
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
LWIP TCP/IP Stack 김백규.
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Internetworking: Hardware/Software Interface
Computer Networking A Top-Down Approach Featuring the Internet
Presentation transcript:

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics and Computer Science, Argonne National Laboratory High Performance Cluster Computing, Dell Inc. Computer Science and Engineering, Ohio State University

Hybrid Stacks for High-speed Networks Hardware Offloaded Network Stacks –Intelligent hardware common on popular networks (InfiniBand, Quadrics, hardware iWARP/10GE) –Worked well to achieve high performance –Adding more features  error prone, expensive, complex Multi-core architectures –Increased computational power Hybrid Architectures –Network Accelerators + Multi-cores –Higher Performance + Flexibility to add more features –Qlogic InfiniBand, Myri-10G, hybrid iWARP/10GE Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Sockets Direct Protocol (SDP) Industry standard high-performance sockets for IB and iWARP Defined for two purposes: –Maintain compatibility for existing applications –Deliver the performance of networks to the applications Mapping of ‘byte-stream’ protocol to ‘message’ oriented semantics –Zero copy (for large messages) –Buffer copy (for small messages) High-speed Network Device Driver IP TCP Sockets Sockets Direct Protocol (SDP) Sockets Applications or Libraries Advanced Features Offloaded Protocol SDP allows applications to utilize the network performance and capabilities with ZERO modifications Current SDP stacks are heavily optimized for hardware offloaded protocol stacks Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

The Problem The problem with network stacks –Have not been able to keep pace with shift of paradigm Sockets Direct Protocol (SDP) –Assumes complete offload –Optimizations like data buffering for small messages, message-level flow control Beneficial on hardware-offload network stack but redundant on hybrid networks. Imposes significant overheads! SDP on hybrid stacks: Case study with iWARP/10GE –Understand drawbacks of current SDP implementation –Propose enhanced SDP design to avoid redundancy –Study the impact of the new design on applications and benchmarks Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Presentation Layout Introduction Overview of iWARP (architecture and different designs) SDP for Hybrid hardware-software iWARP Experimental Evaluation Conclusions and Future Work Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

iWARP Components Relatively new initiative by IETF/RDMAC Extensions to Ethernet: –Richer interface (zero-copy, RDMA) –Backward compatible with TCP/IP Three Protocol Layers –RDMAP: Interface layer for applications –RDDP: Core of the iWARP stack Connection management, packet de- multiplexing between connections –MPA: Glue layer to deal with backward compatibility with TCP/IP CRC-based data integrity Backward compatibility to TCP/IP using markers Application Sockets SDP, MPI etc. Software TCP/IP 10-Gigabit Ethernet RDMAP Verbs RDDP MPA Offloaded TCP/IP Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Need for MPA: Issues with Out-of-Order Packets Packet Header iWARP Header Data Payload Packet Header iWARP Header Data Payload Packet Header iWARP Header Data Payload Packet Header iWARP Header Partial Payload Packet Header Partial Payload Packet Header iWARP Header Data Payload Packet Header iWARP Header Data Payload Delayed Packet Out-Of-Order Packets (Cannot identify iWARP header) Intermediate Switch Segmentation Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Handling Out-of-Order Packets in iWARP RDMAPRDDP CRC Markers TCP/IP RDMAP Markers TCP/IP RDDPCRC Markers TCP/IP RDMAP RDDPCRC HOST NIC SoftwareHardwareHybrid DDP Header Payload (IF ANY) DDP Header Payload (IF ANY) PadCRC Marker Segment Length Packet structure becomes overly complicated Performing in hardware no longer straight forward! Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Implementations of iWARP Several implementations exist –Hardware implementations Optimized for performance Do not offer advanced features –Software implementations More feature complete (handling out-of-order communication, packet drops etc) Not-optimized for performance –Hybrid implementations [balaji07:iwarp] Best of both worlds [balaji07:iwarp] “Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP”. P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur and W. Gropp. SC ‘07 Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Presentation Layout Introduction Overview of iWARP (architecture and different designs) SDP for Hybrid hardware-software iWARP Experimental Evaluation Conclusions and Future Work Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

SDP Limitations for Hybrid Network Stacks Current SDP implementations –Heavily optimized for hardware offloaded protocol stacks –Do not perform well on Hybrid stacks Performance limiting features of SDP on hybrid stacks –Redundant buffer copy for small messages –Protocol interface extensions for message coalescing –Asynchronous flow control –Portability across hybrid stacks Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Redundant Buffer Copy SDP performs intermediate buffering for small messages –Avoids memory registration costs for small messages iWARP performs buffering to implement markers –Strips of data need to be inserted in between the message Our approach to avoiding buffering redundancy: –Integrate SDP and iWARP buffering into a single copy based on information from the iWARP stack (e.g., TCP sequence number) SDP copies while leaving gaps for the markers iWARP fills in the markers into the space left by SDP –Loss of generality: close interaction between SDP and iWARP –Reduces buffering; improves performance Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Message Coalescing Improves performance for small messages –Difficult to implement for hardware offloaded stacks –Easier in hybrid stacks as software resources can be used Issue: protocol stacks such as iWARP have no interface to perform message coalescing –Message sent out as soon as the application calls a send Our solution: –Extend iWARP interface for applications to “append” to messages –If a message is still queued and the next message can be added to it, so the iWARP implementation can coalesce the messages Improves small message performance, as lesser headers are sent –No performance loss, as previous message was anyway queued Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Query Mechanism for iWARP Portability across different network stacks affected by proposed changes –E.g., disabling buffer copy is beneficial only for hybrid stacks, and not for hardware offloaded stacks Different hybrid stacks might provide different features –We should not have to develop a separate SDP for each such network stack Solution: Extend iWARP to allow applications to query functionality –E.g., is buffer copy provided in software? Allows SDP to query functionality and execute appropriately Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Presentation Layout Introduction Overview of iWARP (architecture and different designs) SDP for Hybrid hardware-software iWARP Experimental Evaluation Conclusions and Future Work Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

SDP Latency and Bandwidth The enhanced SDP/iWARP outperforms the basic SDP/iWARP in both the latency (10%) and bandwidth (20%) benchmarks Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

SDP Cache to Network Traffic Ratio Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Application-level Performance The enhanced SDP/iWARP outperforms the basic SDP/iWARP by 5% for the iso-surface application and virtual microscope application Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Presentation Layout Introduction Overview of iWARP (architecture and different designs) SDP for Hybrid hardware-software iWARP Experimental Evaluation Conclusions and Future Work Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Conclusions and Future Work Current implementations of SDP are optimized for hardware offloaded network stacks –Performance overhead on hybrid stacks due to redundant features We presented an extended design for SDP –Optimizes its execution based on underlying network features (e.g., what features are offloaded/onloaded) –Demonstrated significant performance benefits Future Work: –Extend support for hybrid network stacks to other programming models as well Pavan Balaji, Argonne National Laboratory (HiPC: 12/20/2008)

Thank You ! Contacts: P. Balaji: S. Bhagvat: D. K. Panda: R. Thakur: Web links: