“GPU for realtime processing in HEP experiments” Piero Vicini (INFN) On behalf of GAP collaboration Beijing, 17.5.2013.

Slides:



Advertisements
Similar presentations
Sander Klous on behalf of the ATLAS Collaboration Real-Time May /5/20101.
Advertisements

From Quark to Jet: A Beautiful Journey Lecture 1 1 iCSC2014, Tyler Dorland, DESY From Quark to Jet: A Beautiful Journey Lecture 1 Beauty Physics, Tracking,
Digital Filtering Performance in the ATLAS Level-1 Calorimeter Trigger David Hadley on behalf of the ATLAS Collaboration.
Foreground-Background Separation on GPU using order based approaches Raj Gupta, Sailaja Reddy M., Swagatika Panda, Sushant Sharma and Anurag Mittal Indian.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
HLT - data compression vs event rejection. Assumptions Need for an online rudimentary event reconstruction for monitoring Detector readout rate (i.e.
A Fast Level 2 Tracking Algorithm for the ATLAS Detector Mark Sutton University College London 7 th October 2005.
The Silicon Track Trigger (STT) at DØ Beauty 2005 in Assisi, June 2005 Sascha Caron for the DØ collaboration Tag beauty fast …
Copyright© 2000 OPNET Technologies, Inc. R.W. Dobinson, S. Haas, K. Korcyl, M.J. LeVine, J. Lokier, B. Martin, C. Meirosu, F. Saka, K. Vella Testing and.
27 th June 2008Johannes Albrecht, BEACH 2008 Johannes Albrecht Physikalisches Institut Universität Heidelberg on behalf of the LHCb Collaboration The LHCb.
Real Time 2010Monika Wielers (RAL)1 ATLAS e/  /  /jet/E T miss High Level Trigger Algorithms Performance with first LHC collisions Monika Wielers (RAL)
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Using the Trigger Test Stand at CDF for Benchmarking CPU (and eventually GPU) Performance Wesley Ketchum (University of Chicago)
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
Tracking at the ATLAS LVL2 Trigger Athens – HEP2003 Nikos Konstantinidis University College London.
The CMS Level-1 Trigger System Dave Newbold, University of Bristol On behalf of the CMS collaboration.
Niko Neufeld, CERN/PH-Department
Extracted directly from:
The High-Level Trigger of the ALICE Experiment Heinz Tilsner Kirchhoff-Institut für Physik Universität Heidelberg International Europhysics Conference.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
GBT Interface Card for a Linux Computer Carson Teale 1.
“L1 farm: some naïve consideration” Gianluca Lamanna (CERN) & Riccardo Fantechi (CERN/Pisa)
Copyright © 2000 OPNET Technologies, Inc. Title – 1 Distributed Trigger System for the LHC experiments Krzysztof Korcyl ATLAS experiment laboratory H.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
1 “Fast FPGA-based trigger and data acquisition system for the CERN experiment NA62: architecture and algorithms” Authors G. Collazuol(a), S. Galeotti(b),
HEP 2005 WorkShop, Thessaloniki April, 21 st – 24 th 2005 Efstathios (Stathis) Stefanidis Studies on the High.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Latest ideas in DAQ development for LHC B. Gorini - CERN 1.
G. Volpi - INFN Frascati ANIMMA Search for rare SM or predicted BSM processes push the colliders intensity to new frontiers Rare processes are overwhelmed.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
First Results from the NA62 Straw Spectrometer E uropean P hysical S ociety 2015, Wien Vito Palladino - CERN On Behalf of NA62 Collaboration.
Overview of the High-Level Trigger Electron and Photon Selection for the ATLAS Experiment at the LHC Ricardo Gonçalo, Royal Holloway University of London.
“MultiRing Almagest with GPU” Gianluca Lamanna (CERN) Mainz Collaboration meeting TDAQ WG.
1/13 Future computing for particle physics, June 2011, Edinburgh A GPU-based Kalman filter for ATLAS Level 2 Trigger Dmitry Emeliyanov Particle Physics.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.
A Fast Hardware Tracker for the ATLAS Trigger System A Fast Hardware Tracker for the ATLAS Trigger System Mark Neubauer 1, Laura Sartori 2 1 University.
KIP Ivan Kisel, Uni-Heidelberg, RT May 2003 A Scalable 1 MHz Trigger Farm Prototype with Event-Coherent DMA Input V. Lindenstruth, D. Atanasov,
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
Living Long At the LHC G. WATTS (UW/SEATTLE/MARSEILLE) WG3: EXOTIC HIGGS FERMILAB MAY 21, 2015.
ROM. ROM functionalities. ROM boards has to provide data format conversion. – Event fragments, from the FE electronics, enter the ROM as serial data stream;
August 24, 2011IDAP Kick-off meeting - TileCal ATLAS TileCal Upgrade LHC and ATLAS current status LHC designed for cm -2 s 7+7 TeV Limited to.
The BTeV Pixel Detector and Trigger System Simon Kwan Fermilab P.O. Box 500, Batavia, IL 60510, USA BEACH2002, June 29, 2002 Vancouver, Canada.
DAQ Selection Discussion DAQ Subgroup Phone Conference Christopher Crawford
Quark Matter 2002, July 18-24, Nantes, France Dimuon Production from Au-Au Collisions at Ming Xiong Liu Los Alamos National Laboratory (for the PHENIX.
APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,
EPS HEP 2007 Manchester -- Thilo Pauly July The ATLAS Level-1 Trigger Overview and Status Report including Cosmic-Ray Commissioning Thilo.
K + → p + nn The NA62 liquid krypton electromagnetic calorimeter Level 0 trigger V. Bonaiuto (a), A. Fucci (b), G. Paoluzzi (b), A. Salamon (b), G. Salina.
Giovanna Lehmann Miotto CERN EP/DT-DI On behalf of the DAQ team
M. Bellato INFN Padova and U. Marconi INFN Bologna
“Heterogeneous computing for future triggering”
NaNet Problem: lower communication latency and its fluctuations. How?
Intelligent trigger for Hyper-K
The New CHOD detector for the NA62 experiment at CERN S
More technical description:
Electronics Trigger and DAQ CERN meeting summary.
2018/6/15 The Fast Tracker Real Time Processor and Its Impact on the Muon Isolation, Tau & b-Jet Online Selections at ATLAS Francesco Crescioli1 1University.
G.Lamanna (CERN) NA62 Collaboration Meeting TDAQ-WG
Electronics, Trigger and DAQ for SuperB
“Use of GPU in realtime”
The Silicon Track Trigger (STT) at DØ
High Level Trigger Studies for the Efstathios (Stathis) Stefanidis
The LHCb Trigger Niko Neufeld CERN, PH.
LHCb Trigger, Online and related Electronics
The LHCb Level 1 trigger LHC Symposium, October 27, 2001
The LHCb Front-end Electronics System Status and Future Development
Presentation transcript:

“GPU for realtime processing in HEP experiments” Piero Vicini (INFN) On behalf of GAP collaboration Beijing,

GAP Realtime GAP 3 years GAP (GPU Application Project) for Realtime in HEP and medical imaging is a 3 years project funded by the Italian minister of research started in the beginning of April. G.Lamanna M.FioriniA.MessinaApenet Roma Group It involves three groups (~20 people): INFN Pisa (G.Lamanna), Ferrara (M.Fiorini) and Roma (A.Messina) with the partecipation of the Apenet Roma Group. Several position will be opened to work on GAP. Contact us for further information ( ). 2

GAP Realtime “Realization of an innovative system for complex calculations and pattern recognition in real time by using commercial graphics processors (GPU). Application in High Energy Physics experiments to select rare events and in medical imaging for CT, PET and NMR.” HEPGPU For what concern HEP we will study the GPU application in low level hardware triggers with reduced latency and high level software triggers. NA62 L0 ATLAS high level muon trigger“physics cases” We will consider the NA62 L0 and the ATLAS high level muon trigger as “physics cases” for our studies. 3

NA62: Overview Main goal: K→  Main goal: BR measurement of the ultrarare K→  (BR SM =(8.5 ± 0.7)· ) SM New Physics Stringent test of SM, golden mode for search and characterization of New Physics flight O(100) Novel technique: kaon decay in flight, O(100) events in 2 years of data taking 4 Huge Huge background: veto system Hermetic veto system PID Efficient PID Weak Weak signal signature: High resolution High resolution measurement of kaon and pion momentum Ultra rare Ultra rare decay: High intensity High intensity beam trigger system Efficient and selective trigger system

The NA62 TDAQ system 5 L0 trigger Trigger primitives Data CDR O(KHz) EB GigaEth SWITCH L1/L2 PC RICHMUVCEDARLKRSTRAWSLAV L0TP L0 1 MHz 10 MHz L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC L1/L2 PC 100 kHz L1 trigger L1/2 L0Hardware synchronous level10 MHz 1 MHz 1 ms L0: Hardware synchronous level. 10 MHz to 1 MHz. Max latency 1 ms. L1Software level 1 MHz100 kHz L1: Software level. “Single detector”. 1 MHz to 100 kHz L2Software level 100 kHz few kHz. L2: Software level. “Complete information level”. 100 kHz to few kHz.

Two problems to use GPU at L0 Computing powerGPU tens of MHz Computing power: Is the GPU fast enough to take trigger decision at tens of MHz events rate? LatencyGPU tiny latency synchronous trigger Latency: Is the GPU latency per event small enough to cope with the tiny latency of a low level trigger system? Is the latency stable enough for usage in synchronous trigger systems? 6

GPU computing power 7

GPU processing 8 GPU MEM Chipset CPU SYSTEM MEM PCIe GbE  20 events data (1404 byte) sent from Readout board to the GbE NIC are stored in a receiving host kernel buffer.  T=0 start of send operation on the Readout Board. μs

GPU processing 9 GPU MEM Chipset CPU SYSTEM MEM PCIe GbE μs Data are copied from kernel buffer to a user space buffer. 99

GPU processing GPU MEM Chipset CPU SYSTEM MEM PCIe GbE 104 μs Data are copied from system memory to GPU memory

GPU processing 11 GPU MEM Chipset CPU SYSTEM MEM PCIe GbE 134 Ring pattern-matching GPU Kernel is executed, results are stored in device memory. μs

GPU processing 12 GPU MEM Chipset CPU SYSTEM MEM PCIe GbE Results are copied from GPU memory to system memory (322 bytes – 20 results) 139 μs

GPU processing: open problems (1) 13 GPU MEM Chipset CPU SYSTEM MEM PCIe GbE 109 μs  lat comm ≅ 110 μs  4 x lat proc

GPU processing: open problems(2) 14 GPU MEM Chipset CPU SYSTEM MEM PCIe GbE 109 μs  Fluctuations on the GbE component of lat comm may hinder the real-time requisite, even at low events count. sockperf: Summary: Latency is usec sockperf: Total observations; each percentile contains observations sockperf: ---> observation = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> percentile = sockperf: ---> observation =

Two approaches: PF_RING driver PF_RING GAP Fast packet capturing from standard NIC (PF_RING driver from The author is in the GAP collaboration.) user space memory. The data are written directly on the user space memory. Skip redundant copy in the kernel memory space. 1 Gb/s 10 Gb/s Both for 1 Gb/s and 10 Gb/s RTOS Latency fluctuations could be reduced using RTOS (under study). 15 NIC GPU chip set CPURAM PCI express VRAM

Two approaches: NaNet (1) NaNetAPEnet+ NaNet derived from APEnet+ design. 3D Torus network for HPC hybrid parallel computing platform GPU Direct RDMA support at HW level not-NVIDIAP2P First not-NVIDIA device having P2P connection with a GPU. Joint development with NVIDIA. High-end FPGA based PCI Gen2 x8 card 6 fully bidirectional Torus 16

Two approaches: NaNet(2) 17 Multiple link tech support apelink bidir 34 Gbps 1 GbE Future interconnection standards New features: UDP offload: extract the payload from UDP packets NaNet Controller: encapsulate the UDP payload in a newly forged APEnet+ packet and send it to the RX NI logic Low latency receiving GPU buffer management Implemented either on APEnet+ card and ALTERA dev kit Lonardo A. “Building a Low-latency, Real-time, GPU-based Stream Processing System” GTC2013 Conference

NaNet preliminary results 18 In the 1 GbE link L0 GPU-based Trigger Processor prototype the sweet spot between latency and throughput is in the region of Kb of event data buffer size, corresponding to events. Sustained Bandwidth ~119.7 MB/s NaNet transit time (7.3  s ÷ 8.6  s)

19 First application: RICH RICH ~17 m RICH Neon 1 Atm Neon two spots ~1000 PMs 18 mm Light focused by two mirrors on two spots equipped with ~1000 PMs each (pixel 18 mm) GeV/c~18 hits 3  separation in GeV/c, ~18 hits per ring in average ~100 ps ~10 MHz ~100 ps time resolution, ~10 MHz events rate Time reference Time reference for trigger Vessel diameter 4→3.4 m Beam Beam Pipe Mirror Mosaic (17 m focal length) Volume ~ 200 m 3 2 × ~1000 PM

Algorithms for single ring search 20 DOMH/POMH 1000 histogram DOMH/POMH: Each PM (1000) is considered as the center of a circle. For each center an histogram is constructed with the distances btw center and hits. HOUGH test circle ring center matching point 3D HOUGH: Each hit is the center of a test circle with a given radius. The ring center is the best matching point of the test circles. Voting procedure in a 3D parameters space

Algorithms for single ring search 21 TRIPL: (“triplets”) several averaging TRIPL: In each thread the center of the ring is computed using three points (“triplets”).For the same event, several triplets (but not all the possible) are examined at the same time. The final center is obtained by averaging the obtained center positions MATH: centroid least square method linear system MATH: Translation of the ring to centroid. In this system a least square method can be used. The circle condition can be reduced to a linear system, analitically solvable, without any iterative procedure.

Processing time 22 MATH The MATH algorithm gives 50 ns/event processing time for packet of >1000 events. DOMH Video Cards The performance on DOMH (the most resource-dependent algorithm) is compared on several Video Cards The gain due to different generation of video cards can be clearly recognized.

Processing time stability stability synchronous The stability of the execution time is an important parameter in a synchronous system GPUTesla C1060MATH The GPU (Tesla C1060, MATH algorithm) shows a “quasi deterministic” behavior with very small tails. 23 GPU aren’t affected The GPU temperature, during long runs, rises in different way on the different chips, but the computing performances aren’t affected.

Data transfer time data transfer time The data transfer time significantly influence the total latency It depends on the number of events to transfer GPU  CPU The transfer time is quite stable (double peak structure in GPU  CPU transfer) page locked memory parallelized Tesla C2050 Using page locked memory the processing and data transfer can be parallelized (double data transfer engine on Tesla C2050) 24

TESLA C1060 TESLA C1060 On TESLA C1060 the results both in computing time and total latency are very encouraging 300 us About 300 us for 1000 events 300 MB/s Throughput about 300 MB/s 25 “Fast online triggering in high-energy physics experiments using GPUs” Nucl.Instrum.Meth.A662:49-54,2012

TESLA C2050 (Fermi) & GTX680 (Kepler) TESLA C2050 GTX680 On TESLA C2050 and GTX680 improves (x4 and x8 respectivelly) The data trasfer latency improves a lot thanks to the streaming and the PCI-express gen3 26

Multirings 2 rings Most of the 3 tracks, which are background, have max 2 rings per spot aren’t suitable for us, Standard multiple rings fit methods aren’t suitable for us, since we need:Trackless Non iterative High resolution Fast~1 us Fast: ~1 us (1 MHz input rate) 27 Ptolemy’s theorem Almagest New approach  use the Ptolemy’s theorem (from the first book of the Almagest) “A quadrilater is cyclic (the vertex lie on a circle) if and only “A quadrilater is cyclic (the vertex lie on a circle) if and only if is valid the relation: if is valid the relation: AD*BC+AB*DC=AC*BD “ AD*BC+AB*DC=AC*BD “

28 A B C D D D randomly N+M triplets Select a triplet randomly (1 triplet per point = N+M triplets in parallel) Ptolemy theorem math, riemann sphere, Taubin, … If the point satisfy the Ptolemy theorem, it is considered for a fast algebraic fit (i.e. math, riemann sphere, Taubin, … ). candidate center point Q quadrilaters Each thread converges to a candidate center point. Each candidate is associated to Q quadrilaters contributing to his definition Q threshold R re-fit For the center candidates with Q greater than a threshold, the points at distance R are considered for a more precise re-fit. fourth point Ptolemy theorem Consider a fourth point: if the point doesn’t satisfy the Ptolemy theorem reject it Almagest algorithm description

Almagest algorithm results 29 generated The real position of the two generated rings is: 1 1  (6.65, 6.15) R=  (8.42,4.59) R=12.6 fitted The fitted position of the two rings is: 1 1  (7.29, 6.57) R=  (8.44,4.34) R=12.26 Tesla C us/event Fitting time on Tesla C1060: 1.5 us/event

The ATLAS experiment

The ATLAS Trigger System L1 L1: information from muon and calorimeter detectors processed by custom electronics L2 L2: Region of Interests with L1 signal, information from all sub-detector, dedicated software algorithms EF EF: full event reconstruction, offline software reconstruction ~60 ms ~1 s

ATLAS: as study case for GPU sw trigger ATLAS The ATLAS trigger system has to cope with the very demanding conditions of the LHC experiments in terms of rate, latency, and event size. LHC luminosity The increase in LHC luminosity and in the number of overlapping events poses new challenges to the trigger system, and new solutions have to be developed for the fore coming upgrades ( ) GPUs high level trigger GPUs are an appealing solution to be explored for such experiments, especially for the high level trigger where the time budget is not marginal and one can profit from the highly parallel GPU architecture ATLAS GPUs muon identification and reconstruction We intend to study the performance of some of the ATLAS high level trigger algorithms as implements on GPUs, in particular those concerning muon identification and reconstruction.

High level muon triggers L2 muon identification is based on: track reconstruction <3ms track reconstruction in the muon spectrometer and in conjunction with the inner detector (<3ms) Isolation ~10ms Isolation of the muon track in a given cone both based on ID tracks and calorimeter energy (~10ms) For both track and energy reconstruction: at least linear pileup Algorithm execution time grows at least linear with pileup parallelizable naturally parallelizable Algorithm purity also depends on the width of the cone: Reconstruction within the cone easily parallelizable

Conclusions (1) GAP GPUs The GAP project aims at studying the possibility to use GPUs in realtime applications. HEPL0 of NA62 muon HLT of ATLAS In HEP triggers this will be studied in the L0 of NA62 and in the muon HLT of ATLAS. RICHNA62 For the moment we focused on the online ring reconstruction in the RICH for NA62. throughput and latency The results on both throughput and latency are encouraging to build a full scale demonstrator. 34

Conclusions (2) ATLAS sw triggers The ATLAS high level trigger offers an interesting opportunity to study the impact of GPUs in the sw triggers of LHC experiments. pileup parallel structure Several trigger algorithms are heavily affected by pileup, this effect can be mitigated by having algorithms with a parallel structure. HLT GPUs high pileup/high intensity As for the offline reconstruction, the next hw generation for HLT will be based on vector and/or parallel processors: the GPUs would be a good partner to speedup the online processing in high pileup/high intensity environment. 35

Realtime? 36