Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.

Slides:



Advertisements
Similar presentations
A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.
Advertisements

Umut Girit  One of the core members of the Internet Protocol Suite, the set of network protocols used for the Internet. With UDP, computer.
Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.
Guide to TCP/IP, Third Edition
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.
Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Introduction to Transport Layer. Transport Layer: Motivation A B R1 R2 r Recall that NL is responsible for forwarding a packet from one HOST to another.
Data Communications Architecture Models. What is a Protocol? For two entities to communicate successfully, they must “speak the same language”. What is.
Research Agenda on Efficient and Robust Datapath Yingping Lu.
William Stallings Data and Computer Communications 7 th Edition Chapter 2 Protocols and Architecture.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
COE 342: Data & Computer Communications (T042) Dr. Marwan Abu-Amara Chapter 2: Protocols and Architecture.
Computer Networks with Internet Technology William Stallings
 The Open Systems Interconnection model (OSI model) is a product of the Open Systems Interconnection effort at the International Organization for Standardization.
Gursharan Singh Tatla Transport Layer 16-May
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
What Can IP Do? Deliver datagrams to hosts – The IP address in a datagram header identify a host IP treats a computer as an endpoint of communication Best.
Protocols and the TCP/IP Suite Chapter 4. Multilayer communication. A series of layers, each built upon the one below it. The purpose of each layer is.
Process-to-Process Delivery:
TRANSPORT LAYER T.Najah Al-Subaie Kingdom of Saudi Arabia Prince Norah bint Abdul Rahman University College of Computer Since and Information System NET331.
IWARP Redefined: Scalable Connectionless Communication Over High-Speed Ethernet M. J. Rashti, R. E. Grant, P. Balaji and A. Afsahi.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
Presentation on Osi & TCP/IP MODEL
Lecture 2 TCP/IP Protocol Suite Reference: TCP/IP Protocol Suite, 4 th Edition (chapter 2) 1.
Mapping of scalable RDMA protocols to ASIC/FPGA platforms
Protocols and the TCP/IP Suite
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
William Stallings Data and Computer Communications 7 th Edition Data Communications and Networks Overview Protocols and Architecture.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.
Data and Computer Communications Chapter 2 – Protocol Architecture, TCP/IP, and Internet-Based Applications.
University of the Western Cape Chapter 12: The Transport Layer.
TCP/IP Transport and Application (Topic 6)
1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.
TCP1 Transmission Control Protocol (TCP). TCP2 Outline Transmission Control Protocol.
UNDERSTANDING THE HOST-TO-HOST COMMUNICATIONS MODEL - OSI LAYER & TCP/IP MODEL 1.
Computer Security Workshops Networking 101. Reasons To Know Networking In Regard to Computer Security To understand the flow of information on the Internet.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Chapter 2 Protocols and the TCP/IP Suite 1 Chapter 2 Protocols and the TCP/IP Suite.
William Stallings Data and Computer Communications
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
BZUPAGES.COM Presentation on TCP/IP Presented to: Sir Taimoor Presented by: Jamila BB Roll no Nudrat Rehman Roll no
Networking Basics CCNA 1 Chapter 11.
Lecture 4 Overview. Ethernet Data Link Layer protocol Ethernet (IEEE 802.3) is widely used Supported by a variety of physical layer implementations Multi-access.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
S305 – Network Infrastructure Chapter 5 Network and Transport Layers.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
CHAPTER 4 PROTOCOLS AND THE TCP/IP SUITE Acknowledgement: The Slides Were Provided By Cory Beard, William Stallings For Their Textbook “Wireless Communication.
Chapter 16 Protocols and Layering. Network Communication Protocol an agreement that specifies the format and meaning of messages computers exchange Network.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Enterprise Network Systems TCP Mark Clements. 3 March 2008ENS 2 Last Week – Client/ Server Cost effective way of providing more computing power High specs.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Network Models. The OSI Model Open Systems Interconnection (OSI). Developed by the International Organization for Standardization (ISO). Model for understanding.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Computer Networks with Internet Technology William Stallings Chapter 2 Protocols and the TCP/IP Protocol Suite.
Computer Networks with Internet Technology William Stallings
Transport Layer Unit 5.
Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.
Process-to-Process Delivery:
Computer Networking A Top-Down Approach Featuring the Internet
Process-to-Process Delivery: UDP, TCP
Transport Layer 9/22/2019.
Presentation transcript:

Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory (NBCL) Ohio State University

Ethernet Overview Ethernet is the most widely used network infrastructure today Traditionally Ethernet has been notorious for performance issues –Near an order-of-magnitude performance gap compared to other networks Cost conscious architecture Most Ethernet adapters were regular (layer 2) adapters Relied on host-based TCP/IP for network and transport layer support Compatibility with existing infrastructure (switch buffering, MTU) –Used by 42.4% of the Top500 supercomputers –Key: Reasonable performance at low cost TCP/IP over Gigabit Ethernet (GigE) can nearly saturate the link for current systems Several local stores give out GigE cards free of cost ! 10-Gigabit Ethernet (10GigE) recently introduced –10-fold (theoretical) increase in performance while retaining existing features

Ethernet: Technology Trends Broken into three levels of technologies –Regular Ethernet adapters Layer-2 adapters Rely on host-based TCP/IP to provide network/transport functionality Could achieve a high performance with optimizations –TCP Offload Engines (TOEs) Layer-4 adapters Have the entire TCP/IP stack offloaded on to hardware Sockets layer retained in the host space –iWARP-aware adapters Layer-4 adapters Entire TCP/IP stack offloaded on to hardware Support more features than TCP Offload Engines –No sockets ! Richer iWARP interface ! –E.g., Out-of-order placement of data, RDMA semantics [feng03:hoti, feng03:sc, balaji04:rait] [balaji05:hoti, balaji05:cluster] [jin05:hpidc, wyckoff05:rait]

Current Usage of Ethernet System Area Network or Cluster Environment Wide Area Networ k Distributed Cluster Environment Regular Ethernet TOE iWARP Regular Ethernet Cluster TOE Cluster iWARP Cluster

Problem Statement Regular Ethernet adapters and TOEs are completely compatible –Network level compatibility (Ethernet + IP + TCP + application payload) –Interface level compatibility (both expose the sockets interface) With the advent of iWARP, this compatibility is disturbed –Both ends of a connection need to be iWARP compliant Intermediate nodes do not need to understand iWARP –The interface exposed is no longer sockets iWARP exposes a much richer and newer API Zero-copy, asynchronous and one-sided communication primitives Not very good for existing applications Two primary requirements for a wide-spread acceptance of iWARP –Software Compatibility for Regular Ethernet with iWARP capable adapters –A common interface which is similar to sockets and has the features of iWARP

Presentation Overview ۩Introduction and Motivation ۩TCP Offload Engines and iWARP ۩Overview of the Proposed Software Stack ۩Performance Evaluation ۩Conclusions and Future Work

Sockets Interface Application or Library What is a TCP Offload Engine (TOE)? Hardware User Kernel TCP IP Device Driver Network Adapter (e.g., 10GigE) Sockets Interface Application or Library Hardware User Kernel TCP IP Device Driver Network Adapter (e.g., 10GigE) Offloaded TCP Offloaded IP Traditional TCP/IP stack TOE stack Data Path

MPA iWARP Protocol Suite RDMAP ULP RDMAPRDDP ULP RDDP TCP SCTP IP Courtesy iWARP Specification In-order Delivery and Out-of- order Placement Middle Box Fragmentation Feature Rich Interface More details provided in the paper or in the iWARP Specification

Presentation Overview ۩Introduction and Motivation ۩TCP Offload Engines and iWARP ۩Overview of the Proposed Software Stack ۩Performance Evaluation ۩Conclusions and Future Work

Proposed Software Stack The Proposed Software stack is broken into two layers –Software iWARP implementation Provides wire compatibility with iWARP-compliant adapters Exposes the iWARP feature set to the upper layers Two implementations provided: User-level iWARP and Kernel-level iWARP –Extended Sockets Interface Extends the sockets interface to encompass the iWARP features Maps a single file descriptor to both the iWARP as well as the normal TCP connection Standard sockets applications can run WITHOUT any modifications Minor modifications to applications required to utilize the richer feature set

Software iWARP and Extended Sockets Interface Application Extended Sockets Interface User-level iWARP IP Sockets TCP Device Driver Network Adapter Application Extended Sockets Interface Kernel-level iWARP TCP (Modified with MPA) IP Device Driver Network Adapter Sockets Application Extended Sockets Interface High Performance Sockets Sockets Network Adapter TCP IP Device Driver Offloaded TCP Offloaded IP Software iWARP Application Extended Sockets Interface High Performance Sockets Sockets Network Adapter TCP IP Device Driver Offloaded TCP Offloaded IP Offloaded iWARP Regular Ethernet AdaptersTCP Offload EnginesiWARP compliant Adapters

Designing the Software Stack User-level iWARP implementation –Non-blocking Communication Operations –Asynchronous Communication Progress Kernel-level iWARP implementation –Zero-copy data transmission and single-copy data reception –Handling Out-of-order segments Extended Sockets Interface –Generic Design to work over any iWARP implementation

Non-Blocking and Asynchronous Communication Post_send() setsockopt() write() Post_recv() setsockopt() Recv_Done() User-level iWARP is a multi-threaded implementation Main Thread Async Thread Main Thread Async Thread

Zero-copy Transmission in Kernel-level iWARP Memory map user buffers to kernel buffers Mapping needs to be in place till the reliability ACK is received Buffers are mapped during memory registration –Avoids mapping overhead during data transmission User Virtual Address Space Physical Address Space Kernel Virtual Address Space Memory Registration Data Transmission

Handling Out-of-order Segments Out-of-Order Packet arrives INTR on arrival NIC socket buffers Wait for Intermediate packets checksum DMA Data is retained in the Socket buffer even after it is placed ! This ensures that TCP/IP handles reliability and not the iWARP stack Iwarp_wait() Data Placed Application NOT notified copy

Presentation Overview ۩Introduction and Motivation ۩TCP Offload Engines and iWARP ۩Overview of the Proposed Software Stack ۩Performance Evaluation ۩Conclusions and Future Work

Experimental Test-bed Cluster of Four Node P-III 700MHz Quad-nodes 1GB 266MHz SDRAM Alteon Gigabit Ethernet Network Adapters Packet Engine 4-port Gigabit Ethernet switch Linux smp

Ping-Pong Latency Test

Uni-directional Stream Bandwidth Test

Software Distribution Public Distribution of User-level and Kernel-level Implementations –User-level Library –Kernel module for 2.4 kernels –Kernel patch for kernel –Extended Sockets Interface for software iWARP Contact Information –{panda, –

Presentation Overview ۩Introduction and Motivation ۩TCP Offload Engines and iWARP ۩Overview of the Proposed Software Stack ۩Performance Evaluation ۩Conclusions and Future Work

Concluding Remarks Ethernet has been broken down into three technology levels –Regular Ethernet, TCP Offload Engines and iWARP-compliant adapters –Compatibility between these technologies is important Regular Ethernet and TOE are completely compatible –Both the wire protocol and the ULP interface are the same –iWARP does not share such compatibility Two primary requirements for a wide-spread acceptance of iWARP –Software Compatibility for Regular Ethernet with iWARP capable adapters –A common interface which is similar to sockets and has the features of iWARP We provided a software stack which meets these requirements

Continuing and Future Work The current Software iWARP is only built for Regular Ethernet –TCP Offload Engines provide more features than Regular Ethernet –Needs to be extended to all kinds of Ethernet networks E.g., TCP Offload Engines, iWARP-compliant adapters, Myrinet 10G adapters Interoperability with Ammasso RNICs –Modularized approach to enable/disable components in the iWARP stack Simulated Framework for studying NIC architectures –NUMA Architectures on the NIC for iWARP Offload Flow Control/Buffer Management Features for Extended Sockets

Acknowledgments

Web Pointers Website: Group Homepage: NBCL

Backup Slides

DDP Architecture Extended sockets API –Connection management handled by the standard sockets API –Data transfer carried out using two communication models Untagged Communication Model Tagged Communication Model Out-of-Order Placement; In-order Delivery Segmentation and Re-assembly of messages

DDP Untagged Communication Model SQRQSQRQ

DDP Untagged Model Specifications Simple send-receive based communication model Receiver has to inform DDP about a buffer before hand When data arrives, it is placed in the buffer directly Zero-Copy data transfer No flow control guaranteed by DDP; application takes care of this Explicit message delivery required on the receiver side

DDP Tagged Communication Model SQRQSQRQ

DDP Tagged Model Specifications One-sided communication model Receiver has to inform the sender about a buffer before hand Sender can directly read or write to the receiver buffer Zero-Copy data transfer No flow control required since the receiver is not involved at all No message delivery on the receiver side; only data placement

Out-of-Order Data Placement DDP allows out-of-order data placement –Two segments in a message can be transmitted out of order –Two segments in a message can be placed out of order –A message cannot be delivered till all segments in it are placed –A message cannot be delivered till all previous messages are delivered Reduced buffer requirements Most beneficial for slightly congested networks –TCP Fast retransmit avoids performance degradation –Out-of-order placement avoids extra copies and buffering

Segmentation and Reassembly DDP does not deal with IP fragmentation IP layer does IP reassembly and hands over to DDP Segmentation is tricky in DDP –Message boundaries need to be retained unlike TCP streaming –Sender performs segmentation while maintaining boundaries –Receiver can perform reassembly as long as boundaries are maintained –What about TCP segmentation/reassembly on intermediate nodes? –Layer-4 switches such as Load-Balancers TCP aware; can assume TCP streaming semantics

Layer-4 Switches Load Balancer WAN Google Servers Client

TCP Splicing Load Balancing Application TCP/IP Stack with TCP Splicing Network Interface Card The TCP stack can assume streaming No one-to-one correspondence between the received segments and transmitted segments

Marker PDU Aligned Protocol DDP segments created by sender need not be retained –TCP Splicing DDP header needs to be recognized –If message boundaries are not retained, this is not possible ! –Need a solution independent of message segmentation MPA Protocol –Places strips of data at regular intervals Interval denoted by the TCP sequence number –Each strip points to the DDP header

MPA Protocol DDP Header ULP Payload (if any) DDP Header ULP Payload (if any) CRC Pad Segment Length