HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS Moontae Lee (Nov 20, 2014)Part 1 CS6410.

Slides:

Advertisements

Similar presentations

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

CS 550 Amoeba-A Distributed Operation System by Saie M Mulay.

Home: Phones OFF Please Unix Kernel Parminder Singh Kang Home:

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

OS Spring’03 Introduction Operating Systems Spring 2003.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

Cs238 Lecture 3 Operating System Structures Dr. Alan R. Davis.

1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.

1 I/O Management in Representative Operating Systems.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.

Ethan Kao CS 6410 Oct. 18 th  Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth.

Gursharan Singh Tatla Transport Layer 16-May

Router Architectures An overview of router architectures.

Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.

CS533 Concepts of Operating Systems Class 9 Lightweight Remote Procedure Call (LRPC) Rizal Arryadi.

New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.

Chapter 8 Windows Outline Programming Windows 2000 System structure Processes and threads in Windows 2000 Memory management The Windows 2000 file.

ATM and Fast Ethernet Network Interfaces for User-level Communication Presented by Sagwon Seo 2000/4/13 Matt Welsh, Anindya Basu, and Thorsten von Eicken.

Computer System Architectures Computer System Software

1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.

1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.

CS533 Concepts of Operating Systems Jonathan Walpole.

LWIP TCP/IP Stack 김백규.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

The NE010 iWARP Adapter Gary Montry Senior Scientist

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Chapter 2 Applications and Layered Architectures Sockets.

Chapter 4 Memory Management Virtual Memory.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

Processes and Virtual Memory

The Mach System Silberschatz et al Presented By Anjana Venkat.

Mr. P. K. GuptaSandeep Gupta Roopak Agarwal

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.

I/O Software CS 537 – Introduction to Operating Systems.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Background Computer System Architectures Computer System Software.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho,

Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.

Jonathan Walpole Computer Science Portland State University

Process Management Process Concept Why only the global variables?

Fabric Interfaces Architecture – v4

Chapter 8: Main Memory.

CS703 - Advanced Operating Systems

Application taxonomy & characterization

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Lecture 3: Main Memory.

Presentation transcript:

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS Moontae Lee (Nov 20, 2014)Part 1 CS6410

OverviewOverview 00  Background  User-level Networking (U-Net)  Remote Direct Memory Access (RDMA)  Performance

Index Background 00

Network Communication 01  Send  Application buffer  Socket buffer  Attach headers  Data is pushed to NIC buffer  Receive  NIC buffer  Socket buffer  Parsing headers  Data is copied into Application buffer  Application is scheduled (context switching)

Today’s Theme Faster and lightweight communication! 02

Terms and Problems 03  Communication latency  Processing overhead: message-handling time at sending/receiving ends  Network latency: message transmission time between two ends (i.e., end-to-end latency)

Terms and Problems 03  Communication latency  Processing overhead: message-handling time at sending/receiving ends  Network latency: message transmission time between two ends (i.e., end-to-end latency)  If network environment satisfies  High bandwidth / Low network latency  Long connection durations / Relatively few connections

TCP Offloading Engine (TOE) THIS IS NOT OUR STORY! 04

Our Story 05  Large vs Small messages  Large: transmission dominant  new networks improves (e.g., video/audio stream)  Small: processing dominant  new paradigm improves (e.g., just a few hundred bytes)

Our Story 05  Large vs Small messages  Large: transmission dominant  new networks improves (e.g., video/audio stream)  Small: processing dominant  new paradigm improves (e.g., just a few hundred bytes)  Our underlying picture  Sending many small messages in LAN  Processing overhead is overwhelming (e.g., buffer management, message copies, interrupt)

Traditional Architecture 06  Problem: Messages pass through the kernel  Low performance Duplicate several copies Multiple abstractions between device driver and user apps  Low flexibility All protocol processing inside the kernel Hard to support new protocols and new message send/receive interfaces

History of High-Performance 07  User-level Networking (U-Net)  One of the first kernel-bypassing systems  Virtual Interface Architecture (VIA)  First attempt to standardize user-level communication  Combine U-Net interface with remote DMA service  Remote Direct Memory Access (RDMA)  Modern high-performance networking  Many other names, but sharing common themes

Index U-Net 00

U-Net Ideas and Goals 08  Move protocol processing parts into user space!  Move the entire protocol stack to user space  Remove kernel completely from data communication path

U-Net Ideas and Goals 08  Move protocol processing parts into user space!  Move the entire protocol stack to user space  Remove kernel completely from data communication path  Focusing on small messages, key goals are:  High performance / High flexibility

U-Net Ideas and Goals 08  Move protocol processing parts into user space!  Move the entire protocol stack to user space  Remove kernel completely from data communication path  Focusing on small messages, key goals are:  High performance / High flexibility Low communication latency in local area setting Exploit full bandwidth Emphasis on protocol design and integration flexibility Portable to off-the-shelf communication hardware

U-Net Ideas and Goals 08  Move protocol processing parts into user space!  Move the entire protocol stack to user space  Remove kernel completely from data communication path  Focusing on small messages, key goals are:  High performance / High flexibility Low communication latency in local area setting Exploit full bandwidth Emphasis on protocol design and integration flexibility Portable to off-the-shelf communication hardware

U-Net Ideas and Goals 08  Move protocol processing parts into user space!  Move the entire protocol stack to user space  Remove kernel completely from data communication path  Focusing on small messages, key goals are:  High performance / High flexibility Low communication latency in local area setting Exploit full bandwidth Emphasis on protocol design and integration flexibility Portable to off-the-shelf communication hardware

U-Net Ideas and Goals 08  Move protocol processing parts into user space!  Move the entire protocol stack to user space  Remove kernel completely from data communication path  Focusing on small messages, key goals are:  High performance / High flexibility Low communication latency in local area setting Exploit full bandwidth Emphasis on protocol design and integration flexibility Portable to off-the-shelf communication hardware

U-Net Ideas and Goals 08  Move protocol processing parts into user space!  Move the entire protocol stack to user space  Remove kernel completely from data communication path  Focusing on small messages, key goals are:  High performance / High flexibility Low communication latency in local area setting Exploit full bandwidth Emphasis on protocol design and integration flexibility Portable to off-the-shelf communication hardware

U-Net Architecture 09  Traditionally  Kernel controls network  All communications via the kernel  U-Net  Applications can access network directly via MUX  Kernel involves only in connection setup * Virtualize NI  provides each process the illusion of owning interface to network

U-Net Building Blocks  End points: application’s / kernel’s handle into network  Communication segments: memory buffers for sending/receiving messages data  Message queues: hold descriptors for messages that are to be sent or have been received 10

U-Net Communication: Initialize  Initialization:  Create single/multiple endpoints for each application  Associate a communication segment and send/receive/free message queues with each endpoint 11

U-Net Communication: Send START Composes the data in the communication segment [NI] leaves the  descriptor in the queue NEGATIVEPOSITIVE 12 Push a descriptor for the message onto the send queue Backed- up? [NI] picks up the message and inserts into the network [NI] indicates messages injection status by flag Associated send buffer can be reused [NI] exerts back- pressure to the user processes Is Queue Full? Send as simple as changing one or two pointers!

U-Net Communication: Receive 13 START polling  event driven Read the data using the descriptor from the receive queue Get available space from the free queue Receive Model? Use upcall to signal the arrival Periodically check the status of queue Transfer data into the appropriate comm. segment [NI] demultiplexes incoming messages to their destinations Receive as simple as NIC changing one or two pointers! Push message descriptor to the receive queue

U-Net Protection  Owning process protection  Endpoints  Communication segments  Send/Receive/Free queues  Tag protection  Outgoing messages are tagged with the originating endpoint address  Incoming messages are only delivered to the correct destination endpoint 14 Only owning process can access!

U-Net Zero Copy 15  Base-level U-Net (might not be ‘zero’ copy)  Send/receive needs a buffer  Requires a copy between application data structures and the buffer in the communication segment  Can also keep the application data structures in the buffer without requiring a copy  Direct Access U-Net (true ‘zero’ copy)  Span the entire process address space  But requires special hardware support to check address

Index RDMA 00

RDMA Ideas and Goals 16  Move buffers between two applications via network  Once programs implement RDMA:  Tries to achieve lowest latency and highest throughput  Smallest CPU footprint

RDMA Architecture (1/2)  Traditionally, socket interface involves the kernel  Has a dedicated verbs interface instead of the socket interface  Involves the kernel only on control path  Can access rNIC directly from user space on data path bypassing kernel 17

RDMA Architecture (2/2)  To initiate RDMA, establish data path from RNIC to application memory  Verbs interface provide API to establish these data path  Once data path is established, directly read from/write to buffers  Verbs interface is different from the traditional socket interface. 18

RDMA Building Blocks  Applications use verb interfaces in order to  Register memory: kernel ensures memory is pinned and accessible by DMA  Create a queue pair (QP): a pair of send/receive queues  Create a completion queue (CQ): RNIC puts a new completion-queue element into the CQ after an operation has completed.  Send/receive data 19

RDMA Communication (1/4)  Step 1 20

RDMA Communication (2/4)  Step 2 21

RDMA Communication (3/4)  Step 3 22

RDMA Communication (4/4)  Step 4 23

Index Performance 00

U-Net Performance: Bandwidth 24 * UDP bandwidth * TCP bandwidth size of messagesdata generation by application

U-Net Performance: Latency 25  End-to-end round trip latency size of messages

RDMA Performance: CPU load 26  CPU Load

Modern RDMA Several major vendors: Qlogic (Infiniband), Mellanox, Intel, Chelsio, others RDMA has evolved from the U/Net approach to have three “modes” Infiniband (Qlogic PSM API): one-sided, no “connection setup” More standard: “qpair” on each side, plus a binding mechanism (one queue is for the sends, or receives, and the other is for sensing completions) One-sided RDMA: after some setup, allows one side to read or write to the memory managed by the other side, but pre- permission is required RDMA + VLAN: needed in data centers with multitenancy

Modern RDMA Memory management is tricky: Pages must be pinned and mapped into IOMMU Kernel will zero pages on first allocation request: slow If a page is a mapped region from a file, kernel may try to automatically issue a disk write after updates, costly Integration with modern NVRAM storage is “awkward” On multicore NUMA machines, hard to know which core owns a particular memory page, yet this matters Main reason we should care? RDMA runs at 20, 40Gb/s. And soon 100, 200… 1Tb/s But memcpy and memset run at perhaps 30Gb/s

SoftROCE Useful new option With standard RDMA may people worry programs won’t be portable and will run only with one kind of hardware SoftROCE allows use of the RDMA software stack (libibverbs.dll) but tunnels via TCP hence doesn’t use hardware RDMA at all Zero-copy sends, but needs one-copy for receives Intel iWarp is aimed at something similar Tries to offer RDMA with zero-copy on both sides under the TCP API by dropping TCP into the NIC Requires a special NIC with an RDMA chip-set

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS Jaeyong Sung (Nov 20, 2014) CS6410 Part 2Part 2

Hadoop 2  Big Data  very common in industries e.g. Facebook, Google, Amazon, …  Hadoop  open source MapReduce for handling large data  require lots of data transfers

Hadoop Distributed File System 3  primary storage for Hadoop clusters  both Hadoop MapReduce and HBase rely on it  communication intensive middleware  layered on top of TCP/IP

Hadoop Distributed File System 4  highly reliable fault-tolerant replications  in data-intensive applications, network performance becomes key component HDFS write: replication

Software Bottleneck 5  Using TCP/IP on Linux,  TCP echo

RDMA for HDFS 6  data structure for Hadoop  pairs  stored in data blocks of HDFS  Both write(replication) and read can take advantage of RDMA

FaRM: Fast Remote Memory 7  relies on cache coherent DMA  object’s version number stored both in the first word of the object header and at the start of each cache line NOT visible to the application (e.g. HDFS)

Traditional Lock-free reads 8  For updating the data,

Traditional Lock-free reads 9  Reading requires three accesses

FaRM Lock-free Reads 10  FaRM relies on cache coherent DMA  Version info in each of cache-lines

FaRM Lock-free Reads 11  single RDMA read

FaRM: Distributed Transactions 12  general mechanism to ensure consistency  Two-stage commits (checks version number)

Shared Address Space 13  shared address space consists of many shared memory regions  consistent hashing for mapping region identifier to the machine that stores the object  each machine is mapped into k virtual rings

Transactions inShared Address Space 14  Strong consistency  Atomic execution of multiple operations Shared Address Space Read FreeFree Write Alloc

Communication Primitives 15  One-sided RDMA reads  to access data directly  RDMA writes  circular buffer is used for unidirectional channel  one buffer for each sender/receiver pair  buffer is stored on receiver

benchmark on communication primitives 16

Limited cache space in NIC 17  Some Hadoop clusters can have hundreds and thousands of nodes  Performance of RDMA can suffer as amount of memory registered increases  NIC will run out of space to cache all page tables

Limited cache space in NIC 18  FaRM’s solution: PhyCo  kernel driver that allocates a large number of physically contiguous and naturally aligned 2GB memory regions at boot time  maps the region into the virtual address space aligned on a 2GB boundary

PhyCoPhyCo 19

Limited cache space in NIC 20  PhyCo still suffered as number of clusters increased  because it can run out of space to cache all queue pair 2× × 2 queue pairs per machine = number of machines, = number of threads per machine  single connection between a thread and each remote machine 2× ×  queue pair sharing among threads 2× × /

Connection Multiplexing 21  best value of q depends on cluster size

Experiments 22  Key-value store: lookup scalability

Experiments 23  Key-value store: varying update rates