Balazs Voneki CERN/EP/LHCb Online group

Slides:

Advertisements

Similar presentations

A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.

Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Unified Wire Felix Marti, Open Fabrics Alliance Workshop Sonoma, April 2008 Chelsio Communications.

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,

Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

Leader in Next Generation Ethernet. 2 Outline Where is iWARP Today? Some Proof Points Conclusion Questions.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.

Remigius K Mommsen Fermilab A New Event Builder for CMS Run II A New Event Builder for CMS Run II on behalf of the CMS DAQ group.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.

Roland Dreier Technical Lead – Cisco Systems, Inc. OpenIB Maintainer Sean Hefty Software Engineer – Intel Corporation OpenIB Maintainer Yaron Haviv CTO.

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

The NE010 iWARP Adapter Gary Montry Senior Scientist

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.

Example: Sorting on Distributed Computing Environment Apr 20,

Srihari Makineni & Ravi Iyer Communications Technology Lab

Network Architecture for the LHCb DAQ Upgrade Guoming Liu CERN, Switzerland Upgrade DAQ Miniworkshop May 27, 2013.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

Latest ideas in DAQ development for LHC B. Gorini - CERN 1.

Using Heterogeneous Paths for Inter-process Communication in a Distributed System Vimi Puthen Veetil Instructor: Pekka Heikkinen M.Sc.(Tech.) Nokia Siemens.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

iSER update 2014 OFA Developer Workshop Eyal Salomon

Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

Use case of RDMA in Symantec storage software stack Om Prakash Agarwal Symantec.

Niko Neufeld HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Remigius K Mommsen Fermilab CMS Run 2 Event Building.

A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.

The Evaluation Tool for the LHCb Event Builder Network Upgrade Guoming Liu, Niko Neufeld CERN, Switzerland 18 th Real-Time Conference June 13, 2012.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho,

Status from Martin Josefsson Hardware level performance of Status from Martin Josefsson Hardware level performance of by Jesper Dangaard Brouer.

HTCC coffee march /03/2017 Sébastien VALAT – CERN.

Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST

Disruptive Storage Workshop Lustre Hardware Primer

LHCb and InfiniBand on FPGA

Joint Genome Institute

Alternative system models

Enrico Gamberini, Giovanna Lehmann Miotto, Roland Sipos

RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne

Multi-PCIe socket network device

Tushar Gohad, Intel Moshe Levi, Mellanox Ivan Kolodyazhny, Mirantis

Internetworking: Hardware/Software Interface

Toward Effective and Fair RDMA Resource Sharing

iSCSI-based Virtual Storage System for Mobile Devices

Hybrid Programming with OpenMP and MPI

IBM Power Systems.

RDMA over Commodity Ethernet at Scale

High Throughput Application Messaging

Microsoft Virtual Academy

Achieving reliable high performance in LFNs (long-fat networks)

A Closer Look at NFV Execution Models

Cluster Computers.

Presentation transcript:

Balazs Voneki CERN/EP/LHCb Online group RDMA optimizations on top of 100 Gbps Ethernet for the upgraded data acquisition system of LHCb Balazs Voneki CERN/EP/LHCb Online group TIPP2017, Beijing 23.05.2017

LHCb Upgrade To improve detectors and electronics such that the experiment can run at higher instantaneous luminosity Increase the event rate from 1 MHz to 40 MHz Selection in software Key challenges: Relatively large chunks Everything goes through the network RU=Readout Unit BU=Builder Unit Target instantaneous luminosity = 2 * 10^33 cm^-2 s^-1 The Event Builder Network and the EB-EFF (Event Filter Farm) does not have to be the same technology Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Network technologies Possible 100 Gbps solutions: Intel® Omni-Path EDR InfiniBand 100 Gbps Ethernet Possible 200 Gbps solution: HDR InfiniBand Arguments for Ethernet: Widely used Old, mature, well-tried OPA and IB are single vendor technologies, Ethernet is multi vendor Ethernet is challenging at these speeds Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Iperf result charts TCP Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Iperf result charts TCP UDP High CPU load Major difference between vendors Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Linux network stack Source: http://www.linuxscrew.com/2007/08/13/linux-networking-stack-understanding/ Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

What is RDMA? Remote Direct Memory Access DMA from the memory of one node into the memory of another node without involving either one’s operating system Performed by the network adapter itself, no work needs to be done by the CPUs, caches or context switches Benefits: High throughput Low latency These are especially interesting for High Performance Computing! RoCE (pronounced: Rocky) provides a seamless, low overhead, scalable way to solve the TCP/IP I/O bottleneck with minimal extra infrastructure. iWARP = another alternative RDMA offering. It doesn’t need custom switch settings. How is that so? The implementation of iWARP is using TCP. Both NICs are PCIe Gen 3.0 compliant Source: https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-introduction-to-rdma/ Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

RDMA Technologies Available solutions: RoCE (RDMA over Converged Ethernet) iWARP (internet Wide Area RDMA Protocol) RoCE needs custom settings on the switch (priority queues to guarantee lossless L2 delivery). iWARP does not need that, only the NICs has to support it. Test made using: Chelsio T62100-LP-CR NIC Mellanox CX455A ConnectX-4 NIC Mellanox SN2700 100G Ethernet switch RoCE (pronounced: Rocky) provides a seamless, low overhead, scalable way to solve the TCP/IP I/O bottleneck with minimal extra infrastructure. iWARP = another alternative RDMA offering. It doesn’t need custom switch settings. How is that so? The implementation of iWARP is using TCP. Both NICs are PCIe Gen 3.0 compliant Source: https://www.theregister.co.uk/2016/04/11/nvme_fabric_speed_messaging/ Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Testbed details iWARP testbench elements: 4 nodes of: Dell PowerEdge C6220 2x Intel® Xeon® CPU E5-2670 at 2.60GHz (8 cores, 16 threads) 32 GB DDR3 memory at 1333 MHz Chelsio 100G T62100-LP-CR NIC RoCE testbench elements: Intel® S2600KP 2x Intel® Xeon® CPU E5-2650 v3 at 2.30GHz (10 cores, 20 threads) 64 GB DDR4 memory at 2134 MHz Mellanox CX455A ConnectX-4 100G NIC Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Result charts iWARP RoCE The iWARP results with a single thread run did not go beyond 90 Gbps (top-left figure). Multiple threads needed to be tried to get closer to the theoretical maximum 100 Gbps (top-right figure). The RoCE tests showed nice throughput numbers even with one thread, so in this case there was no need to test the multi-thread case. Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Result charts iWARP iWARP RoCE The iWARP results with a single thread run did not go beyond 90 Gbps (top-left figure). Multiple threads needed to be tried to get closer to the theoretical maximum 100 Gbps (top-right figure). The RoCE tests showed nice throughput numbers even with one thread, so in this case there was no need to test the multi-thread case. Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Result charts iWARP iWARP RoCE By 1 single thread 2.5% CPU for The iWARP results with a single thread run did not go beyond 90 Gbps (top-left figure). Multiple threads needed to be tried to get closer to the theoretical maximum 100 Gbps (top-right figure). The RoCE tests showed nice throughput numbers even with one thread, so in this case there was no need to test the multi-thread case. By 1 single thread 2.5% CPU for all message sizes Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Result charts UDP TCP RoCE By 1 single thread 30 Gbps 2.5% CPU 17 Gbps The iWARP results with a single thread run did not go beyond 90 Gbps (top-left figure). Multiple threads needed to be tried to get closer to the theoretical maximum 100 Gbps (top-right figure). The RoCE tests showed nice throughput numbers even with one thread, so in this case there was no need to test the multi-thread case. By 1 single thread 2.5% CPU for all message sizes Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Result charts UDP TCP RoCE By 1 single thread 94.4 Gbps 98 Gbps 8.6% CPU 98 Gbps 18% CPU 30 Gbps 2.5% CPU 17 Gbps 2.5% CPU UDP TCP RoCE The iWARP results with a single thread run did not go beyond 90 Gbps (top-left figure). Multiple threads needed to be tried to get closer to the theoretical maximum 100 Gbps (top-right figure). The RoCE tests showed nice throughput numbers even with one thread, so in this case there was no need to test the multi-thread case. By 1 single thread 2.5% CPU for all message sizes Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

MPI result charts with RoCE Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Result charts 2 with RoCE Purpose of heat maps: To check stability Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Summary Promising results Pure TCP/IP is inefficient Zero-copy approach is needed Need to understand why the bidirectional heat map is not homogeneous Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki

Thank you! Early results of RDMA optimizations on top of 100 Gbps Ethernet, TIPP2017, Beijing, 23.05.2017 – Balazs Voneki