August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

Slides:



Advertisements
Similar presentations
© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
RDS and Oracle 10g RAC Update Paul Tsien, Oracle.
MPI and RDMA Yufei 10/15/2010. MPI over uDAPL: abstract MPI: most popular parallel computing standard MPI needs the ability to deliver high performace.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Socket Programming.
t Popularity of the Internet t Provides universal interconnection between individual groups that use different hardware suited for their needs t Based.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
Protocols for Wide-Area Data-intensive Applications: Design and Performance Issues Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian.
1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.
The University of New Hampshire InterOperability Laboratory Introduction To PCIe Express © 2011 University of New Hampshire.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Current major high performance networking technologies InfiniBand 10G-Ethernet.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Windows Network Programming ms-help://MS.MSDNQTR.2004JAN.1033/winsock/winsock/windows_sockets_start_page_2.htm 井民全.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Charm Workshop CkDirect: Charm++ RDMA Put Presented by Eric Bohm CkDirect Team: Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele.
Datacenter Fabric Workshop August 22, 2005 Reliable Datagram Sockets (RDS) Ranjit Pandit SilverStorm Technologies
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
Sonoma Feb 6, 2006 Reliable Datagram Sockets (RDS) Ranjit Pandit SilverStorm Technologies
Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Presented by Open MPI on the Cray XT Richard L. Graham Tech Integration National Center for Computational Sciences.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
1 CSE 5346 Spring Network Simulator Project.
OpenFabrics 2.0 rsockets+ requirements Sean Hefty - Intel Corporation Bob Russell, Patrick MacArthur - UNH.
2006 Sonoma Workshop February 2006Page 1 MemFree Technology Gilad Shainer Mellanox Technologies Inc.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
IST 201 Chapter 11 Lecture 2. Ports Used by TCP & UDP Keep track of different types of transmissions crossing the network simultaneously. Combination.
SOCKET PROGRAMMING Presented By : Divya Sharma.
Enhancements for Voltaire’s InfiniBand simulator
High Performance and Reliable Multicast over Myrinet/GM-2
Infiniband Architecture
Implementation and Optimization of MPI point-to-point communications on SMP-CMP clusters with RDMA capability.
Telemedicine.
Fabric Interfaces Architecture – v4
Resource Utilization in Large Scale InfiniBand Jobs
Protocols and the TCP/IP Suite
Application taxonomy & characterization
Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.
Cluster Computers.
Presentation transcript:

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 2 Overview Point-to-Point Architecture OpenIB –Implementation –Results Future Work

Page 3 Point-to-Point Architecture Component Architecture: – “Plug-ins” for different capabilities (e.g. different networks) –Tunable run-time parameters Three component frameworks: –Point-to-point messaging layer (PML) implements MPI semantics –Byte Transfer Layer (BTL) abstracts network interfaces –Memory Pool (mpool) provides for memory management/registration

Page 4 PML Framework Single PML manages multiple BTL modules –Maintains set of BTLs on a per-peer basis –Message fragmentation and scheduling Implements MPI semantics −Synchronous / buffered / ready / normal sends −Persistent requests / Request completion Eager/Rendezvous protocol −Eager send of short messages −Configurable threshold (short vs. long) −Multiple long protocols

Page 5 PML Protocols Send / receive pipeline to / from pre-registered buffers (non-contiguous data) MPI_Alloc_mem support –Red/black tree of memory registrations –BTL associated with registration is used by scheduler –Xfer of contiguous data with 1 RDMA (after match) “Leave pinned” run-time parameter –Registration on first-use –MRU cache (configurable size) of registrations –Bandwidth equivalent to pre-registered buffers (MPI_Alloc_mem)

Page 6 PML Protocols (Continued) Dynamic memory registration/deregistration –Fragment message and build pipeline of RDMA requests –Overlap [de-]registration with RDMA –Bandwidth 97% of pre-registered memory at large message sizes (8Mbytes) –Performance impacted by bus type/bandwidth

Page 7 BTL Framework MPI agnostic Provides simple API to upper layers –Tagged send/receive primitives –One-sided put/get operations Access to data type engine for zero copy data transfer BTL modules natively support commodity networks: –Current (self, shared memory, myrinet GM/MX, Infiniband mvapi/OpenIB, Portals, TCP) –Planned (LAPI, Quadrics Elan4)

Page 8 OpenIB BTL BTL module initialization Resources allocation Connection management Small message Xfer Large message Xfer OpenIB Issues Future Work

Page 9 BTL module initialization A separate BTL module is initialized for each port on each HCA The PML schedules across these BTL modules just as any other interconnect When multiple BTL modules exist peers establish QP connections by matching subnets

Page 10 Resource Allocation

Page 11 SRQ Scalability SRQ- Mbytes K*RQ per QP- Mbytes #postedFrag size- Kbytes Nodes K- multiplier based on number of nodes

Page 12 Connection management Addressing information is exchanged dynamically via an OOB channel –This greatly improves scalability but at the cost of increased first message latency –Connections are established with peers in the same subnet (local subnet routing only)

Page 13 Small Message Xfer –Maintain list of pre-registered fragments for send and recv –List grows dynamically in chunks (more efficient to register) –Small messages are copied to/from pre- registered buffers –Recv descriptors are posted as needed based on min/max thresholds

Page 14 Small Message Performance Average Latency OpenMPI - OpenIB - *optimized5.13 usec OpenMPI - OpenIB - *defaults5.43 usec OpenMPI - Mvapi - *optimized5.64 usec OpenMPI - Mvapi - *defaults5.94 usec Mvapich - Mvapi (rdma/mem poll)4.19 usec Mvapich - Mvapi (send/recv)6.51 usec * Send/Recv based protocol

Page 15 Large Message Xfer RDMA Write and RDMA Read are both supported RDMA Read provides better performance than RDMA Write - control messages are reduced RDMA pipeline protocol performance highly dependent on I/O Bus performance

Page 16 Results OpenMPI/OpenIB - All

Page 17 Results OpenMPI/OpenIB - All - Log

Page 18 Results OpenMPI/OpenIB - Eager limit

Page 19 Results Combined Results

Page 20 Results Combined Results - Log

Page 21 OpenIB Opportunities –User level notification of VM activity Caching of memory registrations can be dangerous Need the ability to detect VM changes that effect memory registrations (such as sbrk and munmap) –Reliable Multicast for collectives –SRQ performance, 2/10 usec penalty, but who’s counting?

Page 22 Future Work Small message RDMA (using working set of peers) - optional Dynamic connection management using Unreliable Datagrams Dynamic connection teardown - optional

Page 23 Source Code Access Subversion repository Download client from: – –v1.2.1 or later Check out with: –svn co ompi –Anonymous, read-only access

Page 24 Questions? Tim Woodall Phone: Galen Shipman

Page 25 Hardware Specs Dual Intel Xeon 3.2 GHz –1024 KB Cache 2 Gbytes memory Bus: Intel Corp. E7525/E7520/E7320 PCI Express Mellanox Technologies MT25208 InfiniHost III Ex 288 Port Voltaire switch