Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Slides:

Advertisements

Similar presentations

1 Symbian Client Server Architecture. 2 Client, who (a software module) needs service from service provider (another software module) Server, who provide.

Advertisements

System Integration and Performance

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

04/14/2008CSCI 315 Operating Systems Design1 I/O Systems Notice: The slides for this lecture have been largely based on those accompanying the textbook.

MPI and RDMA Yufei 10/15/2010. MPI over uDAPL: abstract MPI: most popular parallel computing standard MPI needs the ability to deliver high performace.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

12/13/99 Page 1 IRAM Network Interface Ioannis Mavroidis IRAM retreat January 12-14, 2000.

Split-OS: Operating System Architecture for a Cluster of Intelligent Devices Kalpana Banerjee, Aniruddha Bohra, Suresh Gopalakrishnan, Murali Rangarajan.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

VSP Video Station Protocol Presented by : Mittelman Dana Ben-Hamo Revital Ariel Tal Instructor : Sela Guy Presented by : Mittelman Dana Ben-Hamo Revital.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

ISO Layer Model Lecture 9 October 16, The Need for Protocols Multiple hardware platforms need to have the ability to communicate. Writing communications.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.

Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

Defense by Amit Saha March 25 th, 2004, Rice University ANTS : A Toolkit for Building and Dynamically Deploying Network Protocols David Wetherall, John.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Windows Network Programming ms-help://MS.MSDNQTR.2004JAN.1033/winsock/winsock/windows_sockets_start_page_2.htm 井民全.

CIS 725 High Speed Networks. High-speed networks High bandwidth, high latency Low error rates Message loss mainly due to congestion.

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS Moontae Lee (Nov 20, 2014)Part 1 CS6410.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Brian Bershad, Thomas Anderson, Edward Lazowska, and Henry Levy Presented by: Byron Marohn Published: 1991.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.

The Structuring of Systems Using Upcalls By David D. Clark Presented by Samuel Moffatt.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Interactions with Microarchitectures and I/O Copyright 2004 Daniel.

High Performance and Reliable Multicast over Myrinet/GM-2

Alternative system models

High Performance Messaging on Workstations

CMSC 611: Advanced Computer Architecture

CS703 - Advanced Operating Systems

Application taxonomy & characterization

Presented by Neha Agrawal

Student: Popa Andrei-Sebastian

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Presentation transcript:

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with RDMA Support Liu, Jiang, Wyckoff, Panda, Ashton, Buntinas, Gropp, Toonen Host-Assisted Zero-Copy Remote Memory Access Communication On InfiniBand Tipparaju, Santhanaraman, Nieplocha, Panda Presented by Nikola Vouk Advisor: Dr. Frank Mueller

Background General Buffer Manipulation in Communication Protoocls

InfiniBand 7.6 microsecond latency 857 MB/s peak bandwidth Send/Receive Queue+Work Completed interface Asynchronous calls Remote Direct Memory Access –Between Shared memory architecture and MPI –Not exactly NUMA, but close Provides channel Interface (read/write) for communication Each side registers memory accessible freely to other hosts for security purposes.

Common Problems 1.Link-layer/Network Protocol in- efficiencies (unnecessary messages sent) 2.User-space to System-Buffer copy overhead (copy time) 3.Synchronous sending/receiving and computing (Application has to stop in order to handle requests)

Problem 1: Message Passing Protocol Basic InfiniBand protocol requires three matching writes RDMA CHANNEL INTERFACE Put Operation: Copy user buffer to pre-registered buffer RDMA write buffer to receiver Adjust local head pointer RDMA write new head pointer to receiver Return Bytes written Get Operation 1.Copy data from shared memory to user buffer 2.Adjust Tail Pointer 3.RDMA write new tail pointer to sender 4.Return bytes read

Solutions: Piggybacking and Pipelining Improvement, but still less than 870 MB/s Send Pointer update with Packets Chop buffers into packet size and Send out as message comes in

Problem 2: Internal buffer copying overhead Solution: Zero-Copy Buffers Internal overhead where the user must copy data to system (and into a registered memory slot) Allows system to read directly from the user

Zero-Copy Protocol at different Levels of MPICH Hierarchy If Packet is Large enough… 1.Register user buffer 2.Notify end-host of request 3.End-host sends a RDMA-read 4.Reads from user buffer space

Comparing Interfaces: CH3 interface vs RDMA Interface Implement directly off of CH3 interface More flexible due to access to complete ADI-3 interface Always uses RMDA- write

CH3 Implementation Performance A function of raw underlying performance

Pipelining always performed the worst RDMA Channel within 1% of CH3

Problem 3: To much overhead, not enough execution Unanswered Problems 1.Registration overhead still there even in cached version 2.Data transfer still requires significant cooperation from both sides (taking away from computation) 3.Non-contiguous data not addressed Solutions 1.Provide custom API allocates out of large pre-registers memory chunks 2.Overlapping as much as possible communication with computation 3.Applying zero-copy techniques using scatter/gather RMDA calls

Host-Assisted Zero-Copy Protocol Host sends request for gather from receiver Receiver posts a descriptor and continues working Can be implemented as a “helper” thread on receiving host Same as previous Zero-Copy idea, but extended to Non-contiguous data

NAS MG Again the Pipelined method performs similarly to the zero-copy method

Summa Matrix Multiplication Significant benefit of Host-Assisted Zero-Copy

Conclusions Minimizing internal memory copying removes primary memory performance obstacle Infiniband allows DMA that offloads work from the CPU. Can benefit by coordinating registered memory to minimize CPU involvment With proper coding, can achieve almost wire-speed on existing MPI programs over infiniband Could be implemented on other architectures (Gig-E, Myranet)

Thesis Implications Buddy MPICH is a latency hiding implementation of MPICH also. Separation at the ADI layer. Buddy thread listens for connections and accepts work from worker thread via send/receive queues.