NaNet: A Custom NIC for Low-Latency, Real-Time GPU Stream Processing Alessandro Lonardo, Andrea Biagioni INFN Roma 1.

NaNet: A Custom NIC for Low-Latency, Real-Time GPU Stream Processing Alessandro Lonardo, Andrea Biagioni INFN Roma 1

GPU L0 RICH TRIGGER TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN 2 ■ Very promising results in increasing selection efficiency of interesting events integrating GPUs into the central L0 trigger processor, exploiting their computing power to implement more complex trigger primitives. ■ Requirements for the RICH detector L0 trigger:  Throughput: event rate of 600MB/s  lat L0-GPU = lat proc + lat comm < 1 ms ■ Processing is not an issue. For RICH single ring fitting on, processing of a 1000 events buffer on a Kepler GTX 680:  Throughput of 2.6 GB/s  Latency of 60 μs (data from “Real-Time Use of GPUs in NA62 Experiment”, CERN-PH-EP-2012-260) ■ The real challenge is to implement a RO Board-L0 GPU link with:  Sustained Bandwidth > 600 MB/s, (RO board output on GbE links)  Small and stable latency RO board L0 GPU L0TP 10 MHz 1 MHz Max 1 ms latency

Processing Latency TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN 3  Processing (on buffers of > 1000 events) takes 50 ns per event  lat proc is quite stable (< 200 μs) Once data is available to be processed! Consolidated results on C1060, Fermi and Kepler far better.

Communication Latency TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN 4 lat comm : time needed to copy event data from L0-GPU receiving GbE MAC to GPU memory. Standard NIC data flow: 1.NIC receive incoming packets, data are written in CPU memory buffer (kernel driver network stack protocol handling). 1.CPU writes data to GPU memory (application issued CudaMemcpyHostToDevice). StartEnd Copy data from CPU to GPU Copy results from GPU to CPU Processing time 1000 evts per packet

Communication Latency TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN 5 1) Host to GPU memory data transfer latency for a 1000 event buffer O(100) μs. 2) Time spent in Linux kernel network stack protocol handling for 64B packet data transfers O(10) μs. Both are affected by relevant fluctuations due to OS Jitter.

NaNet Solution TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN 6 Problem: lower communication latency and its fluctuations. How? 1) Offloading the CPU from network stack protocol management. 2) Injecting directly data from the NIC into the GPU(s) memory. NaNet solution: Re-use the APEnet+ FPGA-based NIC already implementing (2) adding a network stack protocol management offloading engine to the logic (UDP Offloading Engine).

APEnet+ 3D Torus Network: ■ Scalable (today up to 32K nodes) ■ Direct Network: no external switches. APEnet+ Card: ■ FPGA based (ALTERA EP4SGX290) ■ PCI Express x16 slot, signaling capabilities for up to dual x8 Gen2 (peak 4+4 GB/s) ■ Single I/O slot width, 4 torus links, 2-d (secondary piggy-back double slot width, 6 links, 3-d torus topology) ■ Fully bidirectional torus links, 34 Gbps ■ Industry standard QSFP ■ A DDR3 SODIMM bank APEnet+ Logic: ■ Torus Link ■ Router ■ Network Interface  NIOS II 32 bit microcontroller  RDMA engine  GPU I/O accelerator. ■ PCI x8 Gen2 Core TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN7

APEnet+ GPUDirect P2P Support TDAQ Meeting - 10 October 2012 – Alessandro Lonardo- INFN 8 APEnet+ Flow ■ P2P between Nvidia Fermi and APEnet+  First non-Nvidia device supporting it!!!  Joint development with Nvidia.  APEnet+ board acts as a peer. ■ No bounce buffers on host. APEnet+ can target GPU memory with no CPU involvement. ■ GPUDirect allows direct data exchange on the PCIe bus. ■ Real zero copy, inter-node GPU-to-host, host- to-GPU and GPU-to-GPU. ■ Latency reduction for small messages.

Overview of the NaNet Implementation TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN 9 PCIe X8 Gen2 core PCIe X8 Gen2 8@5 Gbps TX/RX Block 32bit Micro Controller UDP offload NaNet Ctrl UDP offload NaNet Ctrl memory controller GPU I/O accelerator On Board Memory 1Gb Eth port Network Interface  Stripped down APEnet+ logic, (logically) eliminating torus and router blocks  UDP offloading engine  HAL based microcontroller firmware (essentially used for configuration only)  Implemented on the Altera Stratix IV development system

APEnet+ 3D Torus Network: ■ Scalable (today up to 32K nodes) ■ Cost effective: no external switches. APEnet+ Card: ■ FPGA based (ALTERA EP4SGX290) ■ PCI Express x16 slot, signaling capabilities for up to dual x8 Gen2 (peak 4+4 GB/s) ■ Single I/O slot width, 4 torus links, 2-d torus topology;  secondary piggy-back card, resulting in a double slot width, 6 links, 3-d torus topology. ■ Fully bidirectional torus links, 34 Gbps aggregated raw bandwidth (408 Gbps total switching capacity…) ■ Industry standard QSFP+ (Quad Small Form-factor Pluggable) for high- density applications on copper as well as on optical medium (4*10 Gbps lanes per interface) ■ A DDR3 SODIMM bank TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN10

APEnet+ Core: DNP TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN11 Router 7x7 ports switch torus link torus link torus link torus link torus link torus link torus link torus link torus link torus link torus link torus link routing logic arbiter PCIe X8 Gen2 core PCIe X8 Gen2 8@5 Gbps Network Interface Torus Links DNP TX/RX Block 32bit Micro Controller Collective comm block memory controller GPU I/O accelerator On Board Memory 1Gb Eth port X+X+ X-X- Y+Y+ Y-Y- Z-Z- Z+Z+ APEnet based on DNP: ■ RDMA: Zero-copy RX & TX! ■ Small latency and high bandwidth ■ GPU clusters features (APEnet+):  RDMA support GPU and host. for GPUs! No buffer copies between  Very good GPU to GPU latency (Direct GPU interface, nvidia P2P. ■ SystemC models, VHDL (synthesizable) code, AMBA Interface (SHAPES), PCI Express Interface (APEnet+). ■ Implementation on FPGA and “almost” tape-out on ASIC The HW block structure is split into: ■ Network Interface:  TX: Gathers data coming in from the PCI-e port, fragmenting data stream into packets forwarded to the relevant destination port.  RX: RDMA (Remote Direct Memory Access) capabilities, PUT and GET, are implemented at the firmware level.  Microcontroller (NIOS II) simplifies the DNP- core HW and the host-side driver. It manages the RDMA LUT allocated in the On Board Memory: Add/delete entries in case of reg/unreg buffer operation. Retrieve the entry to satisfy buffer info requests for the incoming DNP PUT/GET operands.

APEnet+ Core: DNP (2) TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN12 Router 7x7 ports switch torus link torus link torus link torus link torus link torus link torus link torus link torus link torus link torus link torus link routing logic arbiter PCIe X8 Gen2 core PCIe X8 Gen2 8@5 Gbps Network Interface Torus Links DNP TX/RX Block 32bit Micro Controller Collective comm block memory controller GPU I/O accelerator On Board Memory 1Gb Eth port X+X+ X-X- Y+Y+ Y-Y- Z-Z- Z+Z+ ■ ROUTER:  Dimension-order routing policy to implement communications on the switch’s port.  Router allocates and grants the proper path.  Arbiter manages conflicts between packets. ■ MULTIPLE TORUS LINK:  Packet-based direct network 2d/3d torus topology  Bidirectional Ser/Des with 8b10b encoder for DC balance, de-skewing technology and CDR.  Encapsulating the APEnet+ packets into a light, low-level, word-stuffing protocol. Fixed size header/footer envelope.  Error detection via EDAC/CRC at packet level.  Virtual Channels and Flow-Control Logic to guarantee deadlock-free transmission and enhance fault tolerance.

LATENCY BENCHMARK TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN13 ■ One-way point to point test involving two nodes:  Receiver node tasks: Allocates a buffer on either host or GPU memory. Registers it for RDMA. Sends its address to the transmitter node. Starts a loop waiting for N buffer received events. Ends by sending back an acknowledgement packet.  Transmitter node tasks: Waits for an initialization packet containing the receiver node buffer (virtual) memory address Writes that buffer N times in a loop with RDMA PUT Waits for a final ACK packet. ■ No small message optimizations like copying of data in temporary buffers:  Reduced pipelining capability of the APEnet+ HW  No large difference of performance with round-trip test ■ ~ 7-8  s on GPU-GPU test! ■ ~ 2x for GPU TX !! still …

Latency benchmark: P2P effects ■ No P2P = cudaMemcpyD2H/H2D() on host bounce buffers ■ Buffers pinned with cuMemHostRegister ■ cuMemcpy() costs ~ 8/10us ■ MVAPICH2 tested on same test system TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN 14 2 x cuMemcpy() * http://mvapich.cse.ohio-state.edu/performance/mvapich2/inter_gpu.shtml

APEnet+ VS rest of the World TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN 15 ■ Below 32 KB P2P wins ■ 32 kB – 128 KB P2P shows limits ■ over 128 KB Pure bandwidth wins

Bandwidth Benchmark ■ Preliminary result on Fermi: ■ curves exhibit a plateau at message:  Host RX ~ 1.3 GB/s  GPU RX ~ 1.1 GB/s  Accelerate the buffer research performed by the  C ■ GPU TX curves:  P2P read protocol overhead TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN 16

NaNet TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN17 Router 7x7 ports switch torus link torus link torus link torus link torus link torus link routing logic arbiter PCIe X8 Gen2 core PCIe X8 Gen2 8@5 Gbps Network Interface Torus Links DNP TX/RX Block 32bit Micro Controller UDP offload NaNet Ctrl UDP offload NaNet Ctrl memory controller GPU I/O accelerator On Board Memory 1Gb Eth port XYZ NaNet is based on APEnet+: ■ It mantains all the features of the APEnet+ ■ Different card: ALTERA Devkit (with smaller device, Stratix IV EP4SGX230) ■ Router and Torus Link are on board but they are not used at the moment. ■ New Feature: UDP offload and NaNet Controller

UDP OFFLOAD – NaNet Ctrl TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN18 PCIe X8 Gen2 core PCIe X8 Gen2 8@5 Gbps TX/RX Block 32bit Micro Controller UDP offload NaNet Ctrl UDP offload NaNet Ctrl memory controller GPU I/O accelerator On Board Memory 1Gb Eth port Network Interface ■ NiosII UDP Offload:  Open IP  Collection of HW components than can be programmed by Nios II to selectively redirect UDP packets over Altera TSE MAC into an HW processing path.  Porting to Altera Stratix IV EP4SGX230 (the project was based on Stratix II 2SGX90).  clk@200MHz (instead of 35 MHz).  The output of the UDP Offload is the PRBS packet that only contains the number of bytes indicated in the UDP header for its payload. ■ NaNet CTRL:  It implements an Avalon streaming sink interface to collect data coming from the source interface of the UDP offload.  Encapsulate the UDP payload into the APEnet+ packets. 1 header + 1 footer (128 bit word) Payload (128 bit word, max size 4KB – 256 words) TSE MAC TSE MAC SRC SNK ? ? PAYLOAD SIZE NaNet CTRL NaNet CTRL PAYLOAD HEADER FOOTER UDP Offload / NaNet Ctrl UDP Offload UDP Offload

TEST TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN19 ■ Benchmarking Platform:  1U, two multi core INTEL server equipped with APEnet+ card.  1U, S2075 NVIDIA system packing 4 Fermi-class GPUs (~4 Tflops). ■ UDP offload and NaNet CTRL test:  The host generates a data stream of 10 5 32bit word (packet size is 4KB).  The packets follow the standard path.  The Nios II reads the packet and checks whether the data correspond to those sent by the host. ■ Integration of UDP offload and NaNet CTRL in Network Interface completed:  Debugging stage.  Latency measures. TSE MAC TSE MAC PAYLOAD SIZE NaNet CTRL NaNet CTRL PAYLOAD HEADER FOOTER TESTBENCH UDP Offload UDP Offload NIOS HOST ETH

THANK YOU TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN 20 Alessandro Lonardo Pier Stanislao Paolucci Davide Rossetti Piero Vicini Roberto Ammendola Andrea Biagioni Ottorino Frezza Francesca Lo Cicero Francesco Simula Laura Tosoratto

BACK-UP SLIDE TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN 21

QUonG: GPU + 3D NETWORK TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN22 The EURETILE HPC platform is based on the QUonG development. QUantum chromodynamics ON Gpu (QUonG) is a comprehensive initiative aiming at providing a hybrid, GPU-accelerated x86_64 cluster with a 3D toroidal mesh topology, able to scale up to 10 4 /10 5 nodes, with bandwidths and latencies balanced for the requirements of modern LQCD codes. ■ Heterogeneous cluster: PC mesh accelerated with high-end GPU and interconnected via 3-d torus network ■ Tight integration between accelerators (GPU) and custom/reconfigurable network (DNP on FPGA) allowing latency reduction and computing efficiency gain ■ Communicating with optimized custom interconnect (APEnet+), with a standard software stack (MPI, OpenMP, …) ■ Optionally an augmented programming model (cuOS) ■ Community of researchers sharing codes and expertise (LQCD, GWA, Bio- computing, Laser-plasma interaction) ■ GPU by Nvidia:  solid HW and good SW  Collaboration with Nvidia US development team to “integrate” GPU with our network

EURETILE PLATFORM TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN23 ■ 2 parallel and synergic development lines:  High Performance Computing HPC.  Virtual Emulation Platform. ■ Based on common and unifying key elements:  Benchmarks.  Common software tool-chain.  Simulation framework.  Fault tolerant, brain inspired network (i.e. DNP) interfaced to custom ASIP and/or commodities cumping accelerators. ■ Scientific High Performance Computing Platform, leveraging on INFN QUonG project:  Intel CPUs, networked through an interconnected mesh composed of PCI-e boards hosting DNP integrated in FPGA.  Software-programmable accelerators in the form of ASIPs (developed using TARGET’s ASIP design tool- suite) integrated in the FPGA. INFN will also explore the addition of GPGPU. ■ High Abstraction level simulation platform (RWTH- AACHEN):  Based on RISC models provided by RWTH-AACHEN and TLM models of the INFN DNP.

APEnet+ board production and test ■ 4 APEnet+ boards produced during 2011. ■ 15 APEnet+ boards on 2Q/12 and 10 more to complete the QUonG rack for 4Q/12. ■ Preliminary technical test performed by the manufacturer. ■ Deeper functional tests:  Clock Generators Fixed frequency oscillators measured through a digital oscilloscope. Programmable clock (si570) firmware have been validated.  JTAG Chain Stratix IV and MAX2 (EPM2210). 64MB Flash memory. Master controller EPM240 CPLD. Windows OK, complete functionality on Linux obtained by-passing the EPM240 firmware.  PCIe Altera Hard IP + PLDA IP. PLDA test-bench adapted and implemented. Successfull.  Memory SODIMM DDR3. FPGA acts as memory controller. NIOS + Qsys environment (read and write). Still in progress.  Ethernet 2 Ethernet RJ45 connectors (1 main board + 1 daughter board) NIOS + Qsys environment. Still in progress.  Remote Links 6 links (4 main board + 2 daughter board) Tranceiver Toolkit by Altera (bit error rate with random pattern) to find the best parameters (400MHz / 32GB main board – 350MHz / 28GB daughter board) TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN24

Benchmarking Platform ■ 3 slightly different servers  SuperMicro motherboards.  CentOS 5.6/5.7/5.8 x86_64.  Dual Xeon 56xx.  12GB – 24GB DDR3 memory.  Nvidia C2050/M2070 on X16 Gen2 slots. ■ Preliminary benchmarks:  Coded with APEnet RDMA API.  CUDA 4.1.  One-way point to point test involving two nodes. Receiver node tasks: –Allocates a buffer on either host or GPU memory. –Registers it for RDMA. –Sends its address to the transmitter node. –Starts a loop waiting for N buffer received events. –Ends by sending back an acknowledgement packet. Transmitter node tasks: –Waits for an initialization packet containing the receiver node buffer (virtual) memory address –Writes that buffer N times in a loop with RDMA PUT –Waits for a final ACK packet. TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN25

QUonG Status and Next Future EURETILE Review - 12-13 April 2012 – Andrea Biagioni - INFN 26 ■ Deployment of the system in 2012  42U standard rack system 60/30 TFlops/rack in single/double precision 25KW/rack (0.4KW/TFlops) 300K€/rack (<5K€/TFlops) ■ Full Rack prototype construction  20 TFlops ready at 1Q/12  Full rack ready at 4Q/12  …waiting for Kepler GPUs 1U, two multi-core INTEL server equipped with APEnet+ card 1U, S2075 NVIDIA system packing 4 Fermi-class GPUs (~4 Tflops) 1U, two multi-core INTEL server equipped with APEnet+ card 10

QUonG status and next future QUonG elementary mechanical assembly: – multi-core INTEL (packed in 2 1U rackable system) – S2090 FERMI GPU system (5 TFlops) – 2 APEnet+ board 42U rack system: – 60 TFlops/rack peak – 25 kW/rack (i.e. 0.4 kW/TFlops) – 300 k€/rack (i.e. 5 K€/TFlops) TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN 27 1+ GPUs APEnet+ 6 Torus Links Cluster node The EURETILE HW Platform demonstrator at 2012 project review will be a stripped version of QUonG elementary mechanical assembly with 2 CPU systems with/without GPUs connected with ApeNet+ boards To demonstrate: –running prototype of EURETILE HW platform –preliminary implementation of “faults awareness” hardware block (sensors registers and link error counter read,…) CPU GPGPU CPU GPGPU CPU GPGPU CPU GPGPU CPU GPU if APEnet+ GPU if APEnet+ GPU if APEnet+ GPU if APEnet+ GPU if APEnet+ CPU GPGPU CPU GPU if APEnet+ GPU if APEnet+

GPU support: P2P ■ CUDA 4.0:  Uniform address space  GPUdirect 2.0 aka P2P among up to 8 GPUs  CUDA 4.1: P2P protocol with alien devices ■ P2P between Nvidia Fermi and APEnet+  First non-Nvidia device to support it!!!  Joint development with NVidia  APEnet+ card acts as a peer  APEnet+ I/O on GPU FB memory ■ Problems:  work around current chipset bugs  Exotic PCIe topologies  Sandy Bridge Xeon TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN 28

P2P advantages P2P means: ■ Data exchange on the PCIe bus ■ No bounce buffers on host So: ■ Latency reduction for small msg ■ Avoid host cache pollution for large msg ■ Free GPU resources, e.g. for GPU-to-GPU memcpy ■ More room for comp/comm overlap TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN 29

DNP technical details TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN30 PCI Express Interface —Built on ALTERA PCI HardIP + commercial wrapper/multi-dma engine (PLDA EZDMA2) with (up to) 8 independent and concurrent DMA engines High speed serial link interface to DNP core —Multiple virtual channel to avoid deadlock —4 bounded independent serial links each lane running at 8.5 Gb/s —Each lane providing CDR, 8b10b encoder, de-skewing logic,… Design of hardware/firmware RDMA supporting sub-system based on ALTERA native uP (NIOS II) Experimental direct interface for GPU and/or custom integrated accelerator Lane Transmitter Channel Byte Ser 8b10b Enc TX PMA Receiver Channel RX PMA Deserializer CDR Serializer Word Aligner 8b10b Dec Byte Deser Byte Order Deskew Fifo

THE END TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN 31 THANK YOU

NaNet: A Custom NIC for Low-Latency, Real-Time GPU Stream Processing Alessandro Lonardo, Andrea Biagioni INFN Roma 1.

Similar presentations

Presentation on theme: "NaNet: A Custom NIC for Low-Latency, Real-Time GPU Stream Processing Alessandro Lonardo, Andrea Biagioni INFN Roma 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NaNet: A Custom NIC for Low-Latency, Real-Time GPU Stream Processing Alessandro Lonardo, Andrea Biagioni INFN Roma 1.

Similar presentations

Presentation on theme: "NaNet: A Custom NIC for Low-Latency, Real-Time GPU Stream Processing Alessandro Lonardo, Andrea Biagioni INFN Roma 1."— Presentation transcript:

Similar presentations

About project

Feedback