Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.

Slides:

Advertisements

Similar presentations

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.

Advertisements

Device Virtualization Architecture

Threads, SMP, and Microkernels

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,

Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.

Computer Systems/Operating Systems - Class 8

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Dawson R. Engler, M. Frans Kaashoek, and James O'Tool Jr.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.

COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.

MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.

Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)

ATM and Fast Ethernet Network Interfaces for User-level Communication Presented by Sagwon Seo 2000/4/13 Matt Welsh, Anindya Basu, and Thorsten von Eicken.

LWIP TCP/IP Stack 김백규.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Operating System 4 THREADS, SMP AND MICROKERNELS

Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Unit OS6: Device Management 6.1. Principles of I/O.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Srihari Makineni & Ravi Iyer Communications Technology Lab

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

Operating System 4 THREADS, SMP AND MICROKERNELS.

The Mach System Silberschatz et al Presented By Anjana Venkat.

Full and Para Virtualization

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

Virtualization.

Distributed Shared Memory

Fabric Interfaces Architecture – v4

High Performance Messaging on Workstations

Chapter 9: Virtual-Memory Management

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

I/O Systems I/O Hardware Application I/O Interface

CS703 - Advanced Operating Systems

Fast Communication and User Level Parallelism

Threads Chapter 4.

MPJ: A Java-based Parallel Computing System

Presentation transcript:

Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of Electrical and Computer Engineering Presented by Constantin Serban, R.U.

VIA Goals Communication infrastructure for System Area Networks (SANs) Targets mainly high speed cluster applications Efficiently harnesses the communication performance of underlying networks

Trends The peak bandwidth increase two order of magnitude over past decade while user latency decreased modestly. The latency introduced by the protocol is typically several times the latency of the transport layer. The problem becomes acute especially for small messages

Targets VI architecture addresses the following issues: Decrease the latency especially for small messages (used in synchronization) Increase the aggregate bandwidth (only a fraction of the peak bandwidth is utilized) Reduce the CPU processing due to the message overhead

Overhead Overhead mainly comes from two sources: Every network access requires one-two traps into the kernel –user/kernel mode switch is time consuming Usually two data copies occur: –From the user buffer to the message passing API –From message layer to the kernel buffer

VIA approach Remove the kernel from the critical path –Moving communication code out of the kernel into user space Provide 0-copy protocol –Data is sent/received directly into the user buffer, no message copy is performed

VIA emerged as a standardization effort from Compaq, Intel, and Microsoft It was built on several academic ideas: The main architecture most similar to U-Net Essential features derived from VMMC Among current implementations : –GigaNet cLan – VIA implemented in hardware –Tandem ServerNet –VIA software driver emulated –Myricom Myrinet - software emulated in firmware

VIA architecture

VIA operations Set-Up/Tear-Down : VIA is point-to-point connection oriented protocol VI-endpoint : the core concept in VIA Register/De-Register Memory Connect/Disconnect Transmit Receive RDMA

VIA operations Set-Up/Tear-Down :VIA is point-to-point connection oriented protocol VI-endpoint : the core concept in VIA VipCreateVi function creates a VI endpoint in the user space. The user-level library passes the call to the kernel agent which passes the creation information to the NIC. OS thus controls the application access to the NIC

VIA operations - cont’d Register/De-Register Memory: All data buffers and descriptors reside in a registered memory NIC performs DMA I/O operation in this registered memory Registration pins down the pages into the physical memory and provides a handle to manipulate the pages and transfer the addresses to the NIC It is performed once, usually at the beginning of the communication session

VIA operations - cont’d Connect/Disconnect: Before communication, each endpoint is connected to a remote endpoint The connection is passed to the kernel agent and down to the NIC VIA does not define any addressing scheme, existing schemes can be used in various implementations

VIA operations - cont’d Transmit/receive: The sender builds a descriptor for the message to be sent. The descriptor points to the actual data buffer. Both descriptor and data buffer resides in a registered memory area. The application then posts a doorbell to signal the availability of the descriptor.The doorbell contains the address of the descriptor. The doorbells are maintained in an internal queue inside the NIC

VIA operations - cont’d Transmit/receive (cont’d): Meanwhile, the receiver creates a descriptor that points to an empty data buffer and posts a doorbell in the receiver NIC queue When the doorbell in the sender queue has reached the top of the queue, through a double indirection the data is sent into the network. The first doorbell/ descriptor is picked up from the receiver queue and the buffer is filled out with data

VIA operations - cont’d RDMA: As a mechanism derived from VMMC, VIA allows Remote DMA operations: RDMA Read and Write Each node allocates a receive buffer and registers it with the NIC. Additional structures that contain read and write pointers to the receive buffers are exchanged during connection setu Each node can read and write to the remote node address directly. These operations posts potential implementation problems.

Evaluation Benchmarks Two VI implementations : –GigaNet cLan B:125MB/sec, Latency 480ns –Tandem ServerNet, 50MB/S, Latency 300ns Performance measured: –Bandwidth and Latency –Poling vs. Blocking –CPU Utilization

Bandwidth

Latency

Latency Polling/Blocking

CPU utilization

MPI performance using VIA The challenge is to deliver performance to distributed application Software layers such MPI are mostly used between VIA and the application: provide increased usability but they bring additional overhead How to optimize this layer in order to use it efficiently with VIA ?

MPI VIA - performance

MPI observations Difference between MPI-UDP and MPI- VIA-baseline is remarkable MPI-VIA-baseline is dramatically far from VIA-Native Several improvements proposed to shift MPI-Via to be closer to VIA native : reduce MPI overhead

MPI Improvements Eliminating unnecessary copies: MPI UDP and VIA use a single set of receiving buffers, thus data should be copied to the application : allow the user to register any buffer Choosing a synchronization primitive: All synchronization formerly using OS constructs/events. Better implementation using swap processor commands No Acknowledge: Remove the acknowledge of the message by switching to a reliable VIA mode

VIA - Disadvantages Polling vs. blocking synchronization – a tradeoff between CPU consumption and overhead Memory registration: locking large amount of memory makes virtual memory mechanisms inefficient. Registering / deregistering on the fly is slow Point-to-point vs. multicast: VIA lacks multicast primitives. Implementing multicast over the actual mechanism, makes communication inefficient

Conclusion Small latency for small messages. Small messages have a strong impact on application behavior Significant improvement over UDP communication (still after recent TCP/UDP hardware implementations?) At the expense of an uncomfortable API