Niko Neufeld, CERN/PH Many stimulating, fun discussions with my event-building friends in ALICE, ATLAS, CMS and LHCb, and with a lot of smart people in.

Slides:

Advertisements

Similar presentations

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Brocade VDX 6746 switch module for Hitachi Cb500

SAN Last Update Copyright Kenneth M. Chipps Ph.D. 1.

LHCb Upgrade Overview ALICE, ATLAS, CMS & LHCb joint workshop on DAQ Château de Bossey 13 March 2013 Beat Jost / Cern.

Remigius K Mommsen Fermilab A New Event Builder for CMS Run II A New Event Builder for CMS Run II on behalf of the CMS DAQ group.

I/O Channels I/O devices getting more sophisticated e.g. 3D graphics cards CPU instructs I/O controller to do transfer I/O controller does entire transfer.

LHCb readout infrastructure NA62 TDAQ WG Meeting April 1 st, 2009 Niko Neufeld, PH/LBC.

Storage area network and System area network (SAN)

PCIe based readout U. Marconi, INFN Bologna CERN, May 2013.

Niko Neufeld CERN PH/LBC. Detector front-end electronics Eventbuilder network Eventbuilder PCs (software LLT) Eventfilter Farm up to 4000 servers Eventfilter.

PHY 201 (Blum) Buses Warning: some of the terminology is used inconsistently within the field.

Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.

Chassis Architecture Brandon Wagner Office of Information Technology

High Performance Computing G Burton – ICG – Oct12 – v1.1 1.

EVOLVING TRENDS IN HIGH PERFORMANCE INFRASTRUCTURE Andrew F. Bach Chief Architect FSI – Juniper Networks.

Niko Neufeld, CERN/PH-Department

Computer Graphics Graphics Hardware

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.

Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.

Siposs Arnold Konrad Computer Networks Coordonator: Mr. Dr. Z. Pólkowski.

March 9, 2015 San Jose Compute Engineering Workshop.

Network Architecture for the LHCb DAQ Upgrade Guoming Liu CERN, Switzerland Upgrade DAQ Miniworkshop May 27, 2013.

Technical workshop on InfiniBand for Trigger/DAQ CERN Jan 14 th to 15 th, 2013 Introduction & Welcome Lorne Levinson (Weizmann Institute of Science) Niko.

Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

Niko Neufeld PH/LBC. Detector front-end electronics Eventbuilder network Eventbuilder PCs (software LLT) Eventfilter Farm up to 4000 servers Eventfilter.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Overview of DAQ at CERN experiments E.Radicioni, INFN MICE Daq and Controls Workshop.

LHCb Upgrade Architecture Review BE DAQ Interface Rainer Schwemmer.

Latest ideas in DAQ development for LHC B. Gorini - CERN 1.

1 Recommendations Now that 40 GbE has been adopted as part of the 802.3ba Task Force, there is a need to consider inter-switch links applications at 40.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. HP Update IDC HPC Forum.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Guido Haefeli CHIPP Workshop on Detector R&D Geneva, June 2008 R&D at LPHE/EPFL: SiPM and DAQ electronics.

Why it might be interesting to look at ARM Ben Couturier, Vijay Kartik Niko Neufeld, PH-LBC SFT Technical Group Meeting 08/10/2012.

Niko Neufeld, CERN/PH. ALICE – “A Large Ion Collider Experiment” Size: 26 m long, 16 m wide, 16m high; weight: t 35 countries, 118 Institutes Material.

LHCb ideas and path toward vectorization and parallelization Niko Neufeld CERN, Wednesday, Nov 28 th 2012 Fourth International Workshop for Future Challenges.

Niko Neufeld, CERN/PH. Online data filtering and processing (quasi-) realtime data reduction for high-rate detectors High bandwidth networking for data.

DAQ interface + implications for the electronics Niko Neufeld LHCb Electronics Upgrade June 10 th, 2010.

Niko Neufeld HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

Niko Neufeld, CERN. Trigger-free read-out – every bunch-crossing! 40 MHz of events to be acquired, built and processed in software 40 Tbit/s aggregated.

Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

Pierre VANDE VYVRE ALICE Online upgrade October 03, 2012 Offline Meeting, CERN.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Update on Optical Fibre issues Rainer Schwemmer. Key figures Minimum required bandwidth: 32 Tbit/s –# of 100 Gigabit/s links > 320, # of 40 Gigabit/s.

ROM. ROM functionalities. ROM boards has to provide data format conversion. – Event fragments, from the FE electronics, enter the ROM as serial data stream;

Next Generation HPC architectures based on End-to-End 40GbE infrastructures Fabio Bellini Networking Specialist | Dell.

Cisco Confidential 1 © 2010 Cisco and/or its affiliates. All rights reserved. Fiber Channel over Ethernet Marco Voi – Cisco Systems – Workshop CCR INFN.

The Evaluation Tool for the LHCb Event Builder Network Upgrade Guoming Liu, Niko Neufeld CERN, Switzerland 18 th Real-Time Conference June 13, 2012.

APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,

Enhancements for Voltaire’s InfiniBand simulator

Balazs Voneki CERN/EP/LHCb Online group

LHCb and InfiniBand on FPGA

Niko Neufeld (quasi) real-time connectivity requirements ”CERN openlab workshop.

Challenges in ALICE and LHCb in LHC Run3

Niko Neufeld LHCb Upgrade Online Computing Challenges CERN openlab Workshop on Data Center Technologies and Infrastructures, Mar 2017.

ITS 145: Intro to Information Systems

RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne

OCP: High Performance Computing Project

Architecture & Organization 1

Low Latency Analytics HPC Clusters

Architecture & Organization 1

Event Building With Smart NICs

Network Processors for a 1 MHz Trigger-DAQ System

Presentation transcript:

Niko Neufeld, CERN/PH Many stimulating, fun discussions with my event-building friends in ALICE, ATLAS, CMS and LHCb, and with a lot of smart people in CERN/IT (openlab) and industry are gratefully acknowledged

Prediction is very difficult, especially about the future. (attributed to Niels Bohr) I use what I learned over the last 4 – 5 years in presentations, private talks and on the net I have tried to discard anything, which I remember to be under NDA. If you are interested in something I did not say/write, come see me after If you think I spilt any beans don’t tell anybody (but tell me!) N. Neufeld - Future event-builders2

I am not going to explain event-building, in particular not the couple of 1000 lines of C- code, which are often referred to as the “eventbuilder” This is about technology, architecture and cost N. Neufeld - Future event-builders3

4 It’s a good time to do DAQ

Links, PCs and networks

PCs used to be relatively modest I/O performers (compared to FPGAs), this has radically changed with PCIe Gen3 Xeon processor line has now 40 PCIe Gen3 lanes / socket Dual-socket system has a theoretical throughput of 1.2 Tbit/s(!) Tests suggest that we can get quite close to the theoretical limit (using RDMA at least) This is driven by the need for fast interfaces for co- processors (GPGPUs, XeonPhi) For us (even in LHCb) CPU will be the bottle-neck in the server - not the LAN interconnect – 10 Gb/s by far sufficient N. Neufeld - Future event-builders6

More integration (Intel roadmap) All your I/O are to belong to us  integrate memory controller, PCIe controller, GPU, NIC (3 – 4 years?), physical LAN interface (4 – 5 years)? Advertisement: we are going to study some of these in our common EU project ICE-DIP Aim for high density short distances (!) efficient cooling less real-estate needed (rack-space) Same thing applies mutatis mutandis to ARM (and other CPU architectures) N. Neufeld - Future event-builders7

The champion of all classes in networking 10, 40, 100 out and 400 in preparation As speed goes up so does power-dissipation  high-density switches come with high integration Market seems to separate more and more in to two camps: carrier-class, deep-buffer, high-density, flexible firmware (FPGA, network processors) (Cisco, Juniper, Brocade, Huawei), $$$/port data-centre, shallow buffer, ASIC based, ultra-high density, focused on layer 2 and simple layer 3 features, very low latency, $/port (these are often also called Top Of the Rack TOR) N. Neufeld - Future event-builders8

SpeedCore [ USD / port ]TOR [ USD / port ] 10 Gb/s400 – Gb/s N. Neufeld - Future event-builders9  If possible less core and more TOR ports  buffering needs to be done elsewhere

Loss-less Ethernet claims to give us what deep-buffer switches do, but without the price (why: because this is made for FCoE and this wants low latency) Basically try to avoid congestion and need for buffering  push buffer-needs into end-nodes (PCs) DCB is an umbrella-term for a zoo of IEEE standards (among them): IEEE 802.1Qb PFC (Priority based Flow Control)IEEE 802.1Qb IEEE 802.1Qaz ETS (Enhanced Transmission Selection)IEEE 802.1Qaz IEEE 802.1Qau CN (Congestion Notification)IEEE 802.1Qau Currently only partially supported, in particular CN not really working in practice, converged NICs expensive  but this might change if FCoE takes off – keep an eye on this. Not clear that this is more effective than explicit traffic shaping on a dedicated network N. Neufeld - Future event-builders10

Driven by a relatively small, agile company: Mellanox Essentially HPC + some DB and storage applications Only competitor: Intel (ex- Qlogic) Extremely cost-effective in terms of Gbit/s / $ Open standard, but almost single-vendor – unlikely for a small startup to enter Software stack (OFED including RDMA) also supported by Ethernet (NIC) vendors Many recent Mellanox products (as of FDR) compatible with Ethernet N. Neufeld - Future event-builders11

SpeedCore [ USD / port ]TOR [ USD / port ] 52 Gb/s N. Neufeld - Future event-builders12  Worth some R&D at least SpeedCore [ USD / port ]TOR [ USD / port ] 10 Gb/s400 – Gb/s

InfiniBand is (almost) a single-vendor technology Can the InfiniBand flow-control cope with the very bursty traffic typical for a DAQ system, while preserving high bandwidth utilization? InfiniBand is mostly used in HPC, where hardware is regularly and radically refreshed (unlike IT systems like ours which are refreshed and grown over time in an evolutionary way). Is there not a big risk that the HPC guys jump onto a new technology and InfiniBand is gone? The InfiniBand software stack seems awfully complicated compared to the Berkeley socket API. What about tools, documentation, tutorial material to train engineers, students, etc…? N. Neufeld - Future event-builders13 * Enter the name of your favorite non-favorite here!

N. Neufeld - Future event-builders14 PCIe Gen4 expected Chelsio T5 (40 GbE and Intel 40 GbE expected Mellanox FDR Mellanox 40 GbE NIC PCIe Gen3 available EDR (100 Gb/s) HCA expected 100 GbE NIC expected

All modern interconnects are multiple serial: (x something SR) Another aspect of “Moore’s” law is the increase of serialiser speed Higher speed reduces number of lanes (fibres) Cheaper interconnects also require availability of cheap optics (VCSEL, Silicon-Photonics) VCSEL currently runs better over MMF (OM3, OM4 for 10 Gb/s and above)  per meter these fibres are more expensive than SMF Current lane-speed 10 Gb/s (same as 8 Gb/s, 14 Gb/s) Next lane-speed (coming soon and already available on high- end FPGAs) is 25 Gb/s (same as 16 Gb/s)  should be safely established by 2017 (a hint for GBT v3 ?) N. Neufeld - Future event-builders15

Brocade MLX: GigE Juniper QFabric: up to GigE (not a single chassis solution) Mellanox SX6536: 648 x 56 Gb (IB) / 40 GbE ports Huawei CE12800: 288 x 40 GbE / 1152 x 10 GbE N. Neufeld - Future event-builders16 Date of release

Many ports does not mean that the interior is similar Some are a high-density combination of TOR elements (Mellanox, Juniper Qfabric) Usually just so priced that you can’t do it cheaper by cabling up the pizza-boxes yourself Some are classical fat cross-bars (Brocade) Some are in-between (Huawei, CLOS but lots of buffering) N. Neufeld - Future event-builders17

Surprisingly (for some) copper cables remain the cheapest option, iff distances are short A copper interconnect is even planned for InfiniBand EDR (100 Gbit/s) For 10 Gigabit Ethernet the verdict is not yet passed, but I am convinced that we will have 10 GBaseT on main-board “for free” It is not yet clear if there will be (a need for) truly high-density 10 GBaseT line-cards (100 ports+) N. Neufeld - Future event-builders18

Multi-lane optics (Ethernet SR4, SR10, InfiniBand QDR) over multi-mode fibres are limited to 100 (OM3) to 150 (OM4) meters Cable assemblies (“direct-attach) cables are either passive (“copper”, “twinax”), very cheap and rather short (max. 4 to 5 m), or active – still cheaper than discreet optics, but as they use the same components internally they have similar range limitations For comparison: price-ratio of 40G QSFP+ copper cable assembly, 40G QSFP+ active cable, 2 x QSFP+ SR4 optics + fibre (30 m) = 1 : 8 : 10 N. Neufeld - Future event-builders19

After event-building all these events need to be processed by software triggers In HPC (“Exascale”) co-processors (Xeon/Phi, GPGPUs) are currently the key technology to achieve this There is also ARM but this is for another talk If they are used in the trigger, it is likely that it will be most efficient to included them into the event-building (i.e. receive data directly on the GPGPU rather than passing through the main CPU – this is supported today using InfiniBand by both Nvidia and Intel Bottle-neck is the interface bus  main driver for more and faster PCIe lanes Interestingly Intel seems also to develop the concept to make the “co-processor” an independent unit on he network  this will clearly require very high-speed network interfaces (>>100 Gb/s to make sense over PCIe Gen3) N. Neufeld - Future event-builders20

There is always the temptation to remove the switch altogether  merge fabric and network Modern versions of an old idea (token-ring, SCI) PCIe based (for example VirtualShare Ronniee a 2D torus based on PCIe, creates a large 64 bit shared memory space over PCIe)VirtualShare Ronniee IBM blue-gene interconnect (11 x 16 Gb/s links integrated on chip – build a 5N torus)blue-gene interconnect N. Neufeld - Future event-builders21

N. Neufeld - Future event-builders22 If we had an infinite number of students, it would be at least cool check it in simulation In any case this will not scale gracefully for event- building, and to be economical would need to be combined with conventional LAN

Make the best of available technologies

Minimize number of “core” network ports Use the most efficient technology for a given connection different technologies should be able to co-exist (e.g. fast for building, slow for end-node) keep distances short Exploit the economy of scale  try to do what everybody does (but smarter ) N. Neufeld - Future event-builders24

GBT: custom radiation- hard link over MMF, 3.2 Gbit/s (about 12000) Input into DAQ network (10/40 Gigabit Ethernet or FDR IB) (1000 to 4000) Output from DAQ network into compute unit clusters (100 Gbit Ethernet / EDR IB) (200 to 400 links) N. Neufeld - Future event-builders25 Detector DAQ network 100 m rock Readout Units Compute Units Long distance covered by low-speed links from detector to Readout Units. Cheap and links required anyhow  Many MMF required (low speed)

N. Neufeld - Future event-builders26 About 1600 core ports

Pull, push, barrel-shifting can be done of course on any topology It is more the switch-hardware (and there in particular the amount of buffer-space) and the properties of the low-level (layer-2) protocol which will make the difference N. Neufeld - Future event-builders27

N. Neufeld - Future event-builders28 10 GBaseT 100 Gb/s bi-directional These 2 not necessarily the same technology

Technology is on our side Truly large DAQ systems can be built and afforded Event-building is still an “unusual” problem – We need to watch out for the best architecture to match what’s “out there” to our needs Which should maximize our options Not follow any crazy fashion But be ready to exploit a new mainstream we cannot anticipate today N. Neufeld - Future event-builders29