The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
G Robert Grimm New York University Disco.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
Figure 1.1 Interaction between applications and the operating system.
12/13/99 Page 1 IRAM Network Interface Ioannis Mavroidis IRAM retreat January 12-14, 2000.
Data Communications Architecture Models. What is a Protocol? For two entities to communicate successfully, they must “speak the same language”. What is.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
1 Design and Performance of a Web Server Accelerator Eric Levy-Abegnoli, Arun Iyengar, Junehwa Song, and Daniel Dias INFOCOM ‘99.
The University of New Hampshire InterOperability Laboratory Introduction To PCIe Express © 2011 University of New Hampshire.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Networks – Network Architecture Network architecture is specification of design principles (including data formats and procedures) for creating a network.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
GBT Interface Card for a Linux Computer Carson Teale 1.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.
A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using.
Chapter 2 Protocols and the TCP/IP Suite 1 Chapter 2 Protocols and the TCP/IP Suite.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
Silberschatz, Galvin and Gagne ©2009 Edited by Khoury, 2015 Operating System Concepts – 9 th Edition, Chapter 13: I/O Systems.
Chapter 6 Storage and Other I/O Topics. Chapter 6 — Storage and Other I/O Topics — 2 Introduction I/O devices can be characterized by Behaviour: input,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Accelerating Large Charm++ Messages using RDMA
CS 286 Computer Organization and Architecture
CSCI 315 Operating Systems Design
Chapter 3: Open Systems Interconnection (OSI) Model
Operating System Concepts
CSC3050 – Computer Architecture
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 13: I/O Systems.
Cluster Computers.
Presentation transcript:

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte Alain Greiner Univ. Paris 6, France

2/21 Outline 1. Introduction 2. The MPC parallel computer 3. MPI-MPC1: the first implementation of MPICH on MPC 4. MPI-MPC2: user-level communications 5. Comparison of both implementations 6. A realistic application 7. Conclusion

3/21 Introduction n Very low cost and high performance parallel computer n PC cluster using optimized interconnection network n A PCI network board (FastHSL) developed at LIP6: l High speed communication network (HSL,1 Gbit/s) l RCUBE: router (8x8 crossbar, 8 HSL ports) l PCIDDC: PCI network controller (a specific communication protocol) n Goal: supply efficient software layers  A specific high-performance implementation of MPICH

4/21 The MPC computer architecture The MPC parallel computer

5/21 Our MPC parallel computer The MPC parallel computer

6/21 The FastHSL PCI board Hardware performances: n latency: 2 µs n Maximum throughput on the link: 1 Gbits/s n Maximum useful throughput: 512 Mbits/s The MPC parallel computer

7/21 The MPC parallel computer The remote write primitive (RDMA)

8/21 PUT: the lowest level software API n Unix based layer: FreeBSD or Linux n Provides a basic kernel API using the PCI-DDC remote write n Implemented as a module n Handles interrupts n Zero-copy strategy using physical memory addresses n Parameters of 1 PUT call: l remote node identifier, l local physical address, l remote physical address, l data length, … n Performances: l 5 µs one-way latency l 494 Mbits/s The MPC parallel computer

9/21 MPI-MPC1: the first implementation of MPICH on MPC MPI-MPC1 architecture

10/21 MPI-MPC1: the first implementation of MPICH on MPC MPICH on MPC: 2 main problems Virtual/physical address translation? Where to write data in remote physical memory?

11/21 MPICH requirements Two kinds of messages: n CTRL messages: control information or limited size user-data n DATA messages: user-data only Services to supply: n Transmission of CTRL messages n Transmission of DATA messages n Network event signaling n Flow control for CTRL messages  Optimal maximum size of CTRL messages?  Match the Send/Receive semantic of MPICH to the remote write semantic MPI-MPC1: the first implementation of MPICH on MPC

12/21 MPI-MPC1 implementation (1) CTRL messages: n pre-allocated buffers, contiguous in physical memory, mapped in virtual process memory n an intermediate copy on both sender & receiver n 4 types: l SHORT: user-data encapsulated in a CTRL message l REQ: request of DATA message transmission l RSP: reply to a request l CRDT: credits, used for flow control DATA messages: n zero-copy transfer mode n rendezvous protocol using REQ & RSP messages n physical memory description of remote user buffer in RSP MPI-MPC1: the first implementation of MPICH on MPC

13/21 MPI-MPC1 implementation (2) MPI-MPC1: the first implementation of MPICH on MPC

14/21 MPI-MPC1: the first implementation of MPICH on MPC MPI-MPC1 performances Each call to the PUT layer = 1 system call Network event signaling uses hardware interrupts Performances of MPI-MPC1: n benchmark: MPI ping-pong n platform: 2 MPC nodes with PII-350 n one-way latency: 26 µs n throughput: 419 Mbits/s  Avoid system calls and interrupts

15/21 MPI-MPC2: user-level communications MPI-MPC1 & MPI-MPC2  Post remote write orders in user mode  Replace interrupts by a polling strategy

16/21 MPI-MPC2: user-level communications MPI-MPC2 implementation Network interface registers are accessed in user mode Exclusive access to shared network resources: n shared objects are kept in the kernel and mapped in user space at starting time n atomic locks are provided to avoid possible competing accesses Efficient polling policy: n polling on the last modified entries of the LME/LMR lists n all the completed communications are acknowledged at once

17/21 Comparison of both implementations MPI-MPC1&MPI-MPC2 performances 421 MPI-MPC1MPI-MPC2 MPI-MPC implementation 26 µs15 µsOne-way latency419 Mbits/s421 Mbits/sThroughput

18/21 Comparison of both implementations MPI-MPC2 latency speed-up

19/21 A realistic application The CADNA software CADNA: Control of Accuracy and Debugging for Numerical Applications n developed in the LIP6 laboratory n control and estimate the round-off error propagation

20/21 A realistic application MPI-MPC performances with CADNA Application: solving a linear system using Gauss method n without pivoting: no communication n with pivoting: a lot of short communications  MPI-MPC2 speed-up = 36%

21/21 Conclusion Conclusions & perspectives n 2 implementations of MPICH on a remote write primitive n MPI-MPC1: l system calls during communication phases l interrupts for network event signaling n MPI-MPC2: l user-level communications l signaling by polling l latency speed-up greater than 40% for short messages n What about maximum throughput? l Locking user buffers in memory and address translations are very expansive l MPI-MPC3  avoid address translations by mapping the virtual process memory in a contiguous space of physical memory at application starting time