Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.

Slides:

Advertisements

Similar presentations

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

Advertisements

A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.

CCNA3: Switching Basics and Intermediate Routing v3.0 CISCO NETWORKING ACADEMY PROGRAM Switching Concepts Introduction to Ethernet/802.3 LANs Introduction.

Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.

04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,

Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.

Remigius K Mommsen Fermilab A New Event Builder for CMS Run II A New Event Builder for CMS Run II on behalf of the CMS DAQ group.

Chapter 19 Binding Protocol Addresses (ARP) Chapter 20 IP Datagrams and Datagram Forwarding.

A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Router Architectures An overview of router architectures.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

Router Architectures An overview of router architectures.

Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

Department of Computer Science and Engineering Applied Research Laboratory 1 A Hardware Based TCP/IP Processing Engine David V. Schuehler

IWARP Redefined: Scalable Connectionless Communication Over High-Speed Ethernet M. J. Rashti, R. E. Grant, P. Balaji and A. Afsahi.

GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.

Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

1 Chapter 16 Protocols and Protocol Layering. 2 Protocol  Agreement about communication  Specifies  Format of messages (syntax)  Meaning of messages.

The NE010 iWARP Adapter Gary Montry Senior Scientist

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

© 2009 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.1 Computer Networks and Internets, 5e By Douglas E. Comer Lecture PowerPoints.

TELE202 Lecture 5 Packet switching in WAN 1 Lecturer Dr Z. Huang Overview ¥Last Lectures »C programming »Source: ¥This Lecture »Packet switching in Wide.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.

Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,

Srihari Makineni & Ravi Iyer Communications Technology Lab

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Computer Security Workshops Networking 101. Reasons To Know Networking In Regard to Computer Security To understand the flow of information on the Internet.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Hot Interconnects TCP-Splitter: A Reconfigurable Hardware Based TCP/IP Flow Monitor David V. Schuehler

CS 4396 Computer Networks Lab Router Architectures.

McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Mr. P. K. GuptaSandeep Gupta Roopak Agarwal

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

Networking update and plans (see also chapter 10 of TP) Bob Dobinson, CERN, June 2000.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Network Processing Systems Design

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

Balazs Voneki CERN/EP/LHCb Online group

High Performance and Reliable Multicast over Myrinet/GM-2

Instructor Materials Chapter 9: Transport Layer

Layered Architectures

Hybrid Programming with OpenMP and MPI

Ch 17 - Binding Protocol Addresses

ECE 671 – Lecture 8 Network Adapters.

Presentation transcript:

Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur and W. Gropp Mathematics and Computer Science, Argonne National Laboratory Department of Computer Science, Virginia Tech Scalable Systems Group, Dell Inc. Computer Science and Engineering, Ohio State University Computer Science, University of Illinois at Urbana Champagne

Motivation High-end computing systems growing rapidly in scale –128K processor system at LLNL (HPC CPU growth of 50%) –1M processor systems as soon as next year Network subsystem has to scale accordingly –Fault-tolerance and hot-spot avoidance important Possible Solution: Multi-pathing –Supported by many networks InfiniBand uses subnet management to discover paths 10-Gigabit Ethernet uses VLAN based multi-pathing –Disadvantage: Out-of-order Communication!

Out-of-order Communication Different packets taking different paths mean that later injected packets might arrive earlier –Physical networks only deal with sending packets out-of-order –Protocols on top of networks (either in hardware or software) have to deal with reordering packets Networks such as IB handle this by dropping out-of-order packets –FECN, BECN and throttling on congestion –Network buffering (with FECN/BECN) helps, but not perfect

Overview of iWARP over Ethernet Relatively new initiative by IETF and RDMAC Backward compatibility with TCP/IP/Ethernet –Sender stuffs iWARP packets within TCP/IP packets –When sent, one TCP packet contains one iWARP packet –What about on receive? Application Sockets SDP, MPI etc. Software TCP/IP 10-Gigabit Ethernet RDMAP Verbs RDDP MPA Offloaded TCP/IP

Ethernet Packet Segmentation Packet Header iWARP Header Data Payload Packet Header iWARP Header Data Payload Packet Header iWARP Header Data Payload Packet Header iWARP Header Partial Payload Packet Header Partial Payload Packet Header iWARP Header Data Payload Packet Header iWARP Header Data Payload Delayed Packet Out-Of-Order Packets (Cannot identify iWARP header) Intermediate Switch Segmentation Intermediate switch segmentation Packets split or coalesced Current iWARP implementations do not handle out-of-order packets Follow approaches used by IB

Problem Statement How do we design a feature-complete iWARP stack? –Provide support for out-of-order arriving packets –Maintaining performance of in-order communication What are the tradeoffs in designing iWARP? –Host-based iWARP –Host-offloaded iWARP –Host-assisted iWARP

Presentation Layout Introduction and Motivation Details of the iWARP Standard Design Choices for iWARP Experimental Evaluation Concluding Remarks and Future Work

Dealing with Out-of-order packets in iWARP iWARP specifies intelligent approaches to deal with out-of-order packets Out-of-order data placement and In-order data delivery –If packets arrive out-of-order, they are directly placed in the appropriate location in memory –Application notified about the arrival of the message only when: All packets of the message have arrived All previous messages have arrived It is necessary that iWARP recognize all packets !

MPA Protocol Frame DDP Header Payload (IF ANY) DDP Header Payload (IF ANY) PadCRC MarkerSegment Length Deterministic approach to identify packet header –Can distinguish in-order packets from out-of-order packets

Presentation Layout Introduction and Motivation Details of the iWARP Standard Design Choices for iWARP Experimental Evaluation Concluding Remarks and Future Work

iWARP components iWARP consists of three layers –RDMAP: Thin layer that deals with interfacing upper layers with iWARP –RDDP: Core of the iWARP stack Component 1: Deals with connection management issues and packet de-multiplexing between connections –MPA: Glue layer to deal with backward compatibility with TCP/IP Component 2: Performs CRC Component 3: Adds marker strips of data to point to the packet header

Component Onload vs. Offload Connection Management and Packet Demultiplexing –Connection lookup and book-keeping --> CPU intensive –Can be done efficiently on hardware Data Integrity: CRC-32 –CPU intensive –Can be done efficiently on hardware Marker Strips: –Tricky as they need to be inserted in between the data –Software implementation requires an extra copy –Hardware implementation might require multiple DMAs

Task distribution for different iWARP designs RDMAPRDDP CRC Markers TCP/IP RDMAP Markers TCP/IP RDDPCRC Markers TCP/IP RDMAP RDDPCRC HOST NIC Host-basedHost-offloadedHost-assisted

Host-based and -offloaded Designs Host-based iWARP: Completely in software –Deals with overheads for all components Host-offloaded iWARP: Completely in hardware –Good for packet demultiplexing and CRC –Is it good for inserting marker strips? Ideal: True Scatter/Gather DMA engine. Not available. Contiguous DMA and Decoupled Marker Insertion –Large chunks DMAed and moved on the NIC to insert markers –A lot of NIC memory transactions Scatter/Gather DMA with Coupled Marker Insertion –Small chunks DMAed and non-contiguously –A lot of DMA operations

Hybrid Host-assisted Implementation Performs tasks such as: –packet demultiplexing and CRC in hardware –marker insertion in software (requires an extra-copy) Fully utilizes both the host and the NIC Summary: –Host-based design suffers from software overheads for all tasks –Host-offloaded design suffers from the overhead of multiple DMA operations –Host-based design suffers from the extra memory copy to add the markers but benefits from less DMAs

Presentation Layout Introduction and Motivation Details of the iWARP Standard Design Choices for iWARP Experimental Evaluation Concluding Remarks

Experimental Test bed 4-node cluster –2 Intel Xeon 3.0GHz processors with 533MHz FSB, 2GB 266-MHz DDR SDRAM and 133 MHx PCI-X slots –Chelsio T110 10GE TCP Offload Engines –12-port Fujitsu XG800 switch –Red Hat Operating system (2.4.22smp)

iWARP Microbenchmarks iWARP Latency iWARP Bandwidth

Out-of-cache Communication iWARP Bandwidth

Computation Communication Overlap Message Size 4KB Message Size 128KB

Iso-surface Visual rendering application Data Distribution Size : 8KB Data Distribution Size : 1MB

Presentation Layout Introduction and Motivation Details of the iWARP Standard Design Choices for iWARP Experimental Evaluation Concluding Remarks

With growing scales of high-end computing systems, network infrastructure has to scale as well –Issues such as fault tolerance and hot-spot avoidance play an important role While multi-path communication can help with these problems, it introduces Out-of-order communication We presented three designs of iWARP that deal with out-of-order communication –Each design has its pros and cons –No single design could achieve the best performance in all cases

Thank You Contacts: P. Balaji: W. Feng: S. Bhagvat: D. K. Panda: R. Thakur: W. Gropp:

Backup Slides

IDLE READY DMA BUSY SDMA Send Request Host DMA Free Host DMA Busy Integrated Segment Complete Host DMA Free READY DMA BUSY SDMA Host DMA Free Host DMA Busy Host DMA Free Marker Inserted Segment Not Complete

IDLE READY DMA BUSY SDMA Host DMA Free Send Request SDMA Done Host DMA Free Host DMA In Use SDMA IDLE READY COPY PARTIAL SEGMENT INSERT MARKERS Segment Available Processing Segment Not Complete Marker Inserted Segment Complete IDLE Calculate CRC Segment Available Segment Complete IDLESEND Segment Available Segment Complete CRC SEND

iWARP Out-of-Cache Communication Bandwidth Cache Traffic (Transmit Side)Cache Traffic (Receive Side)

Impact of marker separation on iWARP performance Host-offloaded iWARP LatencyNIC-offloaded iWARP Bandwidth