NaNet Problem: lower communication latency and its fluctuations. How?

Slides:

Advertisements

Similar presentations

By Sunil G. Kulkarni, SO/F, Pelletron-Linac Facility, BARC-TIFR. 21/01/2011 ASET.

Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009.

Traffic Management - OpenFlow Switch on the NetFPGA platform Chun-Jen Chung( ) SriramGopinath( )

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

t Popularity of the Internet t Provides universal interconnection between individual groups that use different hardware suited for their needs t Based.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

Ethernet Bomber Ethernet Packet Generator for network analysis Oren Novitzky & Rony Setter Advisor: Mony Orbach Started: Spring 2008 Part A final Presentation.

Ethernet Bomber Ethernet Packet Generator for network analysis Oren Novitzky & Rony Setter Advisor: Mony Orbach Spring 2008 – Winter 2009 Midterm Presentation.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

1 Design of the Front End Readout Board for TORCH Detector 10, June 2010.

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.

APE group Remote Direct Memory Access between NVIDIA GPUs with the APEnet 3D Torus Interconnect SC11 Seattle, Nov 14-17,

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

High Speed Digital Design Project SpaceWire Router Student: Asaf Bercovich Instructor: Mony Orbach Semester: Winter 2009/ Semester Project Date:

1.  Project Goals.  Project System Overview.  System Architecture.  Data Flow.  System Inputs.  System Outputs.  Rates.  Real Time Performance.

HyperTransport™ Technology I/O Link Presentation by Mike Jonas.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

GBT Interface Card for a Linux Computer Carson Teale 1.

Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.

Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963.

Srihari Makineni & Ravi Iyer Communications Technology Lab

ENW-9800 Copyright © PLANET Technology Corporation. All rights reserved. Dual 10Gbps SFP+ PCI Express Server Adapter.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

NaNet: A Custom NIC for Low-Latency, Real-Time GPU Stream Processing Alessandro Lonardo, Andrea Biagioni INFN Roma 1.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Ethernet Bomber Ethernet Packet Generator for network analysis

Technical Overview of Microsoft’s NetDMA Architecture Rade Trimceski Program Manager Windows Networking & Devices Microsoft Corporation.

L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,

LECC2004 BostonMatthias Müller The final design of the ATLAS Trigger/DAQ Readout-Buffer Input (ROBIN) Device B. Gorini, M. Joos, J. Petersen, S. Stancu,

ROM. ROM functionalities. ROM boards has to provide data format conversion. – Event fragments, from the FE electronics, enter the ROM as serial data stream;

PC-based L0TP Status Report “on behalf of the Ferrara L0TP Group” Ilaria Neri University of Ferrara and INFN - Italy Ferrara, September 02, 2014.

Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.

APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,

The ALICE Data-Acquisition Read-out Receiver Card C. Soós et al. (for the ALICE collaboration) LECC September 2004, Boston.

Eric Hazen1 Ethernet Readout With: E. Kearns, J. Raaf, S.X. Wu, others... Eric Hazen Boston University.

LISA Linux Switching Appliance Radu Rendec Ioan Nicu Octavian Purdila Universitatea Politehnica Bucuresti 5 th RoEduNet International Conference.

Open-source routing at 10Gb/s Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) SNCNW May 2009 Project grants: Internetstiftelsen (IIS) Equipment:

EXtreme Data Workshop Readout Technologies Rob Halsall The Cosener’s House 18 April 2012.

Mohamed Abdelfattah Vaughn Betz

M. Bellato INFN Padova and U. Marconi INFN Bologna

NFV Compute Acceleration APIs and Evaluation

LHCb and InfiniBand on FPGA

Deterministic Communication with SpaceWire

SB16C1158PCIe SB16C1154PCIe SystemBase’s Semiconductor Products 1.

Infiniband Architecture

HyperTransport™ Technology I/O Link

Erno DAVID, Tivadar KISS Wigner Research Center for Physics (HU)

TELL1 A common data acquisition board for LHCb

FrontEnd LInk eXchange

Reference Router on NetFPGA 1G

Towards 10Gb/s open-source routing

GBT-FPGA Interface Carson Teale.

CS 286 Computer Organization and Architecture

Evolution of S-LINK to PCI interfaces

PCI BASED READ-OUT RECEIVER CARD IN THE ALICE DAQ SYSTEM

BIC 10503: COMPUTER ARCHITECTURE

Network Virtualization

Data Link Issues Relates to Lab 2.

The University of Adelaide, School of Computer Science

Dynamic Packet-filtering in High-speed Networks Using NetFPGAs

SUPPORTING MULTIMEDIA COMMUNICATION OVER A GIGABIT ETHERNET NETWORK

Ch 17 - Binding Protocol Addresses

Reference Router on NetFPGA 1G

TELL1 A common data acquisition board for LHCb

Chapter 13: I/O Systems.

Multiprocessors and Multi-computers

Presentation transcript:

NaNet Problem: lower communication latency and its fluctuations. How? Injecting directly data from the NIC into the GPU memory with no intermediate buffering. Offloading the CPU from network stack protocol management, avoiding OS jitter effects. NaNet solution: Use the APEnet+ FPGA-based NIC implementing GPUDirect RDMA. Add a network stack protocol management offloading engine to the logic (UDP Offloading Engine). CONF TITLE - DATE – Alessandro Lonardo - INFN

APEnet+ 3D Torus Network: APEnet+ Card: APEnet+ Logic: Scalable (today up to 32K nodes) Direct Network: no external switches. APEnet+ Card: FPGA based (ALTERA EP4SGX290) PCI Express x16 slot, signaling capabilities for up to dual x8 Gen2 (peak 4+4 GB/s) Single I/O slot width, 4 torus links, 2-d (secondary piggy-back double slot width, 6 links, 3-d torus topology) Fully bidirectional torus links, 34 Gbps Industry standard QSFP A DDR3 SODIMM bank APEnet+ Logic: Torus Link Router Network Interface NIOS II 32 bit microcontroller RDMA engine GPU I/O accelerator. PCI x8 Gen2 Core CONF TITLE - DATE – Alessandro Lonardo - INFN

APEnet+ GPUDirect P2P Support PCIe P2P protocol between Nvidia Fermi/Kepler and APEnet+ First non-Nvidia device supporting it in 2012. Joint development with Nvidia. APEnet+ board acts as a peer. No bounce buffers on host. APEnet+ can target GPU memory with no CPU involvement. GPUDirect allows direct data exchange on the PCIe bus. Real zero copy, inter-node GPU-to-host, host-to-GPU and GPU-to-GPU. Latency reduction for small messages. APEnet+ Flow CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet is based on APEnet+ Multiple link tech support NaNet Architecture NaNet is based on APEnet+ Multiple link tech support apelink bidir 34 Gbps 1 GbE 10 GbE SFP+ New features: UDP offload: extract the payload from UDP packets NaNet Controller: encapsulate the UDP payload in a newly forged APEnet+ packet and send it to the RX NI logic CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet Implementation – 1 GbE PCIe X8 Gen2 core Network Interface TX/RX Block 32bit Micro Controller UDP offload NaNet CTRL memory controller GPU I/O accelerator On Board Memory port 1GbE Implemented on the Altera Stratix IV dev board PHY Marvell 88E1111 HAL based Nios II microcontroller firmware (no OS) Configuration GbE driving GPU destination memory buffers management CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet Implementation – 10 GbE (work in progress) PCIe X8 Gen2 core Network Interface TX/RX Block 32bit Micro Controller UDP offload NaNet CTRL memory controller GPU I/O accelerator On Board Memory 10 GbE port Implemented on the Altera Stratix IV dev board + Terasic HSMC Dual XAUI to SFP+ daughtercard BROADCOM BCM8727 a dual-channel 10-GbE SFI-to-XAUI transceiver CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet Implementation – Apelink Altera Stratix IV dev board +3 Link Daughter card APEnet+ 4 Link board (+ 2 Link Daughter card) CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet UDP Offload NiosII UDP Offload : Open IP: http://www.alterawiki.com/wiki/Nios_II_UDP_Offload_Example It implements a method for offloading the UDP packet traffic from the Nios II. It collects data coming from the Avalon Streaming Interface of the 1/10 GbE MAC. It redirects UDP packets into an hardware processing data path. Current implementation provides a single 32-bit width channel. 6.4 gbps (working on a 64 bit/12.8 gbps for 10 GbE) The output of the UDP Offload is the PRBS packet (Size + Payload) Ported to Altera Stratix IV EP4SGX230 (the project was based on Stratix II 2SGX90), clk@200MHz (instead of 35 MHz). PCIe X8 Gen2 core Network Interface TX/RX Block 32bit Micro Controller UDP offload NaNet CTRL memory controller GPU I/O accelerator On Board Memory 1/10 GbE port CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet Controller NaNet Controller: It manages the 1/10 GbE flow by encapsulating packets in the APEnet+ packet protocol (Header, Payload, Footer) It implements an Avalon Streaming Interface It generates the Header for the incoming data, analyzing the PRBS packet and several configuration registers. It parallelizes 32-bit data words coming from the Nios II subsystem into 128-bit APEnet+ data words. It redirects data-packets towards the corresponding FIFO (one for the Header/Footer and another for the Payload) PCIe X8 Gen2 core Network Interface TX/RX Block 32bit Micro Controller UDP offload NaNet CTRL memory controller GPU I/O accelerator On Board Memory 1/10 GbE port CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet 1 GbE Test & Benchmark Setup Supermicro SuperServer 6016GT-TF X8DTG-DF motherboard (Intel Tylersburg chipset) dual Intel Xeon E5620 Intel 82576 Gigabit Network Connection Nvidia Fermi S2050 CentOS 5.9, CUDA 4.2, Nvidia driver 310.19. NaNet board in x16 PCIe 2.0 slot NaNet GbE interface directly connected to one host GbE interface Common time reference between sender and receiver (they are the same host!). Ease data integrity tests. GbE x16 PCIe 2.0 slot CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet 1 GbE Latency Benchmark A single host program: Allocates and registers (pin) N GPU receive buffers of size P x1472 byte (the max UDP payload size) In the main loop: Reads TSC cycles (cycles_bs) Sends P UDP packet with 1472 byte payload size over the host GbE intf Waits (no timeout) for a received buffer event Reads TSC cycles (cycles_ar) Records latency_cycles = cycles_ar - cycles_bs Dumps recorded latencies GbE x16 PCIe 2.0 slot CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet 1 GbE Latency Benchmark CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet 1 GbE Latency Benchmark CONF TITLE - DATE – Alessandro Lonardo - INFN

NaNet 1 GbE Bandwidth Benchmark CONF TITLE - DATE – Alessandro Lonardo - INFN

Apelink Latency Benchmark The latency is estimated as half the round trip time in a ping-pong test ~ 8-10 ms on GPU-GPU test World record for GPU-GPU NaNet case represented by H-G APEnet+ G-G latency is lower than IB up to 128KB APEnet+ P2P latency ~8.2 ms APEnet+ staging latency ~16.8 ms MVAPICH/IB latency ~17.4 ms CONF TITLE - DATE – Alessandro Lonardo - INFN

Apelink Bandwidth Benchmark Virtual to Physical Address Translation Implemented in the mC Host RX ~ 1.6 GB/s GPU RX ~ 1.4 GB/s (switching GPU P2P window before writing) Limited by the RX processing GPU TX curves: P2P read protocol overhead Virtual to Physical Address Translation Implemented in HW Max Bandwidth ~2.2GB/s (physical link limit, loopback test ~2.4GB/s) Strong impact on FPGA memory resources! For now limited up to 128KB First implementation. Host RX only! CONF TITLE - DATE – Alessandro Lonardo - INFN

THANK YOU CONF TITLE - DATE – Alessandro Lonardo - INFN