“Heterogeneous computing for future triggering”

Slides:



Advertisements
Similar presentations
Freiburg Seminar, Sept Sascha Caron Finding the Higgs or something else ideas to improve the discovery ideas to improve the discovery potential at.
Advertisements

From Quark to Jet: A Beautiful Journey Lecture 1 1 iCSC2014, Tyler Dorland, DESY From Quark to Jet: A Beautiful Journey Lecture 1 Beauty Physics, Tracking,
LAV trigger primitives Francesco Gonnella Photon-Veto Working Group CERN – 09/12/2014.
LAV contribution to the NA62 trigger Mauro Raggi, LNF ONLINE WG CERN 9/2/2011.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
Niko Neufeld, CERN/PH-Department
GBT Interface Card for a Linux Computer Carson Teale 1.
“ PC  PC Latency measurements” G.Lamanna, R.Fantechi & J.Kroon (CERN) TDAQ WG –
Tracking at LHCb Introduction: Tracking Performance at LHCb Kalman Filter Technique Speed Optimization Status & Plans.
“L1 farm: some naïve consideration” Gianluca Lamanna (CERN) & Riccardo Fantechi (CERN/Pisa)
Copyright © 2000 OPNET Technologies, Inc. Title – 1 Distributed Trigger System for the LHC experiments Krzysztof Korcyl ATLAS experiment laboratory H.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
1 “Fast FPGA-based trigger and data acquisition system for the CERN experiment NA62: architecture and algorithms” Authors G. Collazuol(a), S. Galeotti(b),
NA62 Trigger Algorithm Trigger and DAQ meeting, 8th September 2011 Cristiano Santoni Mauro Piccini (INFN – Sezione di Perugia) NA62 collaboration meeting,
01/04/09A. Salamon – TDAQ WG - CERN1 LKr calorimeter L0 trigger V. Bonaiuto, L. Cesaroni, A. Fucci, A. Salamon, G. Salina, F. Sargeni.
Overview of the High-Level Trigger Electron and Photon Selection for the ATLAS Experiment at the LHC Ricardo Gonçalo, Royal Holloway University of London.
Charmonium Production in 920 GeV Proton-Nucleus Interactions Presented by Wolfgang Gradl for the HERA-B
1 Level 1 Pre Processor and Interface L1PPI Guido Haefeli L1 Review 14. June 2002.
GBT-FPGA Interface Carson Teale. GBT New radiation tolerant ASIC for bidirectional 4.8 Gb/s optical links to replace current timing, trigger, and control.
LKr readout and trigger R. Fantechi 3/2/2010. The CARE structure.
“MultiRing Almagest with GPU” Gianluca Lamanna (CERN) Mainz Collaboration meeting TDAQ WG.
PCIe based readout for the LHCb upgrade U. Marconi, INFN Bologna On behalf of the LHCb Online Group (Bologna-CERN-Padova) CHEP2013, Amsterdam, 15 th October.
LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.
TDAQ news and miscellaneous reports M. Sozzi NA62 TDAQ WG meeting CERN – 13/7/2011.
A Fast Hardware Tracker for the ATLAS Trigger System A Fast Hardware Tracker for the ATLAS Trigger System Mark Neubauer 1, Laura Sartori 2 1 University.
ROM. ROM functionalities. ROM boards has to provide data format conversion. – Event fragments, from the FE electronics, enter the ROM as serial data stream;
“GPU for realtime processing in HEP experiments” Piero Vicini (INFN) On behalf of GAP collaboration Beijing,
29/05/09A. Salamon – TDAQ WG - CERN1 LKr calorimeter L0 trigger V. Bonaiuto, L. Cesaroni, A. Fucci, A. Salamon, G. Salina, F. Sargeni.
The LHCb Calorimeter Triggers LAL Orsay and INFN Bologna.
Gianluca Lamanna TDAQ WG meeting. CHOD crossing point two slabs The CHOD offline time resolution can be obtained online exploiting hit position.
K + → p + nn The NA62 liquid krypton electromagnetic calorimeter Level 0 trigger V. Bonaiuto (a), A. Fucci (b), G. Paoluzzi (b), A. Salamon (b), G. Salina.
M. Bellato INFN Padova and U. Marconi INFN Bologna
“GPU in HEP: online high quality trigger processing”
LHCb and InfiniBand on FPGA
NaNet Problem: lower communication latency and its fluctuations. How?
Generalized and Hybrid Fast-ICA Implementation using GPU
Intelligent trigger for Hyper-K
Update on Trigger simulation
FPGAs for next gen DAQ and Computing systems at CERN
Workshop Concluding Remarks
New Track Seeding Techniques at the CMS experiment
More technical description:
Electronics Trigger and DAQ CERN meeting summary.
evoluzione modello per Run3 LHC
2018/6/15 The Fast Tracker Real Time Processor and Its Impact on the Muon Isolation, Tau & b-Jet Online Selections at ATLAS Francesco Crescioli1 1University.
LHC experiments Requirements and Concepts ALICE
Computing model and data handling
G.Lamanna (CERN) NA62 Collaboration Meeting TDAQ-WG
TELL1 A common data acquisition board for LHCb
Electronics, Trigger and DAQ for SuperB
ALICE – First paper.
R. Piandani2 , F. Spinella2, M.Sozzi1 , S. Venditti 3
Particle detection and reconstruction at the LHC (IV)
GBT-FPGA Interface Carson Teale.
M. Sozzi NA62 TDAQ WG meeting CERN – 20/10/2010
“Use of GPU in realtime”
ALICE Computing Upgrade Predrag Buncic
The LHCb Trigger Niko Neufeld CERN, PH
The LHCb Event Building Strategy
The Silicon Track Trigger (STT) at DØ
FPGA-based Time to Digital Converter and Data Acquisition system for High Energy Tagger of KLOE-2 experiment L. Iafolla1,4, A. Balla1, M. Beretta1, P.
Example of DAQ Trigger issues for the SoLID experiment
The LHCb Trigger Niko Neufeld CERN, PH.
LHCb Trigger, Online and related Electronics
LHCb Particle Identification and Performance
The LHCb Level 1 trigger LHC Symposium, October 27, 2001
The LHCb Front-end Electronics System Status and Future Development
TELL1 A common data acquisition board for LHCb
Reconstruction and calibration strategies for the LHCb RICH detector
Presentation transcript:

“Heterogeneous computing for future triggering” Streaming Readout Workshop 27.1.2017 M.I.T. Boston Gianluca Lamanna (INFN)

FCC (Future Circular Collider) is only an example The problem FCC (Future Circular Collider) is only an example Fixed target, Flavour factories, … the physics reach will be defined by trigger! What the triggers will look like in 2035?

… will be similar to the current trigger… The trigger in 2035… … will be similar to the current trigger… High reduction factor High efficiency for interesting events Fast decision High resolution …but will be also different… The higher background and Pile Up will limit the ability to trigger on interesting events The primitives will be more complicated with respect today: tracks, clusters, rings

All of these effects go in the same direction The trigger in 2035… Higher energy Resolution for high pt leptons → high-precision primitives High occupancy in forward region → better granularity Higher luminosity track-calo correlation Bunch crossing ID becomes challenging, pile up All of these effects go in the same direction More resolution & more granularity  more data & more processing

Classic trigger in the future? Is a traditional “pipelined” trigger possible? Yes and no Cost and dimension Getting all data in one place New links -> data flow No “slow” detectors can participate to trigger (limited latency) Pre-processing on-detector could help FPGA: not suitable for complicated processing Main limitation: high quality trigger primitives generation on detector (processing)

Classic Pipeline: Processing The performances of FPGA as computing device depends on the problem The increasing in computing capability in “standard” FPGA is not as fast as CPU This scenario would change in the future with the introduction of new FPGA+CPU hybrid devices

Is it possible to bring all data on PCs? Triggerless? Is it possible to bring all data on PCs? LHCb: probably yes (in 2020) 40 MHz readout, 30 Tb/s data network, 4000 cores, 8800 links CMS & ATLAS: probably no 4000 Tb/s data, 4M links, x10 in performance for switch, x2000 computing Main limitation: data transport

Triggerless: Data Links The links bandwidth is steadily increasing But the power consumption is not compatible with HEP purposes (rad hard serializers): lpGBT is 500mW per 5GB/s 4M links  2 MW only for links on detector Nowadays standard market is not interested in this application.

An alternative approach Triggerless: Focus on Data Links High Latency Trigger: Heterogeneous computing Toroidal network Time multiplexed trigger Trigger implemented in software Focus on On-detector Buffers Classic pipeline: Focus on On-detector processing

High Latency: Memories The memories price decrease very fast with respect to the CPUs. Moore’s law for memories predicts 10% increase per year

Heterogeneous computing Heterogeneous computing is a way to cheat the Moore law Computing on specialized computing units (CPU, GPU, DSP, CAM, RISC embedded processors, FPGA, …)

The processors no longer get faster just wider GPU vs CPU The GPUs are processors dedicated to parallel programming for graphical applications. The processors no longer get faster just wider SIMD/SIMT parallel architecture

GPU CPU

GPU in low level trigger Computing power: Is the GPU fast enough to take trigger decision at tens of MHz events rate? Latency: Is the GPU latency per event small enough to cope with the tiny latency of a low level trigger system? Is the latency stable enough for usage in synchronous trigger systems?

Low Level Trigger: NA62 Test bench Kaon decays in flight High intensity unseparated hadron beam (6% kaons). Event by event K momentum measurement. Huge background from kaon decays ~108 background wrt signal Good kinematics reconstruction. Efficient veto and PID system for not kinematically constrained background.

Low Level trigger: NA62 Test bench RICH: 17 m long, 3 m in diameter, filled with Ne at 1 atm Reconstruct Cherenkov Rings to distinguish between pions and muons from 15 to 35 GeV 2 spots of 1000 PMs each Time resolution: 70 ps MisID: 5x10-3 10 MHz events: about 20 hits per particle

The NA62 “standard” TDAQ system L0 trigger Trigger primitives Data CDR O(KHz) EB GigaEth SWITCH L1/L2 PC RICH MUV CEDAR LKR STRAWS LAV L0TP L0 1 MHz 10 MHz 100 kHz L1 trigger L1/2 L0: Hardware synchronous level. 10 MHz to 1 MHz. Max latency 1 ms. L1: Software level. “Single detector”. 1 MHz to 100 kHz L2: Software level. “Complete information level”. 100 kHz to few kHz.

Latency: main problem of GPU computing Host PC Total latency dominated by double copy in Host RAM Decrease the data transfer time: DMA (Direct Memory Access) Custom manage of NIC buffers “Hide” some component of the latency optimizing the multi-events computing VRAM DMA NIC GPU Custom Driver PCI express chipset CPU RAM

ALTERA Stratix V dev board (TERASIC DE5-Net board) NaNet-10 ALTERA Stratix V dev board (TERASIC DE5-Net board) PCIe x8 Gen3 (8 GB/s) 4 SFP+ ports (Link speed up to 10Gb/s) GPUDirect /RDMA UDP offload support 4x10 Gb/s Links Stream processing on FPGA (merging, decompression, …) Working on 40 GbE (foreseen 100 GbE)

Nanet-10 VCI 2016 16/02/2016 20

NA62 GPU trigger system Readout event: 1.5 kb (1.5 Gb/s) GPU reduced event: 300 b (3 Gb/s) 2024 TDC channels, 4 TEL62 GPU TEL62 Nanet-10 8x1Gb/s links for data readout 4x1Gb/s Standard trigger primitives 4x1Gb/s GPU trigger GPU NVIDIA K20: 2688 cores 3.9 Teraflops 6GB VRAM PCI ex.gen3 Bandwidth: 250 GB/s Events rate: 10 MHz L0 trigger rate: 1 MHz Max Latency: 1 ms Total buffering (per board): 8 GB Max output bandwidth (per board): 4 Gb/s

Multi rings on the market: Ring fitting Trackless no information from the tracker Difficult to merge information from many detectors at L0 Fast Not iterative procedure Events rate at levels of tens of MHz Low latency Online (synchronous) trigger Accurate Offline resolution required Multi rings on the market: With seeds: Likelihood, Constrained Hough, … Trackless: fiTQun, APFit, possibilistic clustering, Metropolis-Hastings, Hough transform, …

Almagest New algorithm (Almagest) based on Ptolemy’s theorem: “A quadrilater is cyclic (the vertex lie on a circle) if and only if is valid the relation: AD*BC+AB*DC=AC*BD “ Design a procedure for parallel implementation

Almagest i) Select a triplet (3 starting points) A ii) Loop on the remaining points: if the next point does not satisfy the Ptolemy’s condition then reject it D D B D vi) Repeat by excluding the already used points iii) If the point satisfy the Ptolemy’s condition then consider it for the fit C v) Perform a single ring fit iv) …again…

The XY plane is divided in a Grid Histogram The XY plane is divided in a Grid The histograms of the distances is created for each point in the grid

~ 25% target beam intensity (9*1011 Pps) Gathering time: 350ms Results in 2016 NA62 Run Testbed Supermicro X9DRG-QF Intel C602 Patsburg Intel Xeon E5-2602 2.0 GHz 32 GB DDR3 nVIDIA K20c ~ 25% target beam intensity (9*1011 Pps) Gathering time: 350ms Processing time per event: 1 us (to be improved) Processing latency: below 200 us (compatible with the NA62 requirements)

Conclusions Concerning real time processing, in the past HEP used to pose questions and found answer (home made solutions) Nowadays HEP can still propose very important challenges, but we have to look in to the market for solutions An effective approach: a clever integration of commodity heterogeneous computing devices ((low-power)CPUs, GPUs, FPGAs,…) orchestrating their capabilities at the best. Heterogeneous computing could be a good complement to other solutions, with the advantage of exploiting “for free” the development done for other purposes.

R. Ammendola(a), A. Biagioni(b), P. Cretaro(b), S. Di Lorenzo(c) O R. Ammendola(a), A. Biagioni(b), P. Cretaro(b), S. Di Lorenzo(c) O. Frezza(b), G. Lamanna(d), F. Lo Cicero(b), A. Lonardo(b), M. Martinelli(b), P. S. Paolucci(b), E. Pastorelli(b), R. Piandani(f), L. Pontisso(d), D. Rossetti(e), F. Simula(b), M. Sozzi(c), P. Valente(b), P. Vicini(b) (a) INFN Sezione di Roma Tor Vergata (b) INFN Sezione di Roma (c) INFN Sezione di Pisa (d) INFN LNF (e) nVIDIA Corporation, USA