Download presentation
Presentation is loading. Please wait.
1
Triggering events with GPGPU in ATLAS
L. Rinaldi (Università di Bologna and INFN)
2
L. Rinaldi – Triggering with GPU in ATLAS
Institutes and people SMU, Dallas, TX, USA INESC-ID, Universidade de Lisboa, Portugal LIP, Lisboa, Portugal STFC and Rutherford Appleton Lab, GB Sapienza Universita` di Roma, Italy AGH Univ. of Science and Technology, Krakow, Poland Istituto Nazionale di Fisica Nucleare (INFN-Bologna), Italy Universita` di Bologna, Italy INFN-CNAF, Italy Italian people: L. Rinaldi (BO), M. Bauce (RM1), A. Messina (RM1), M. Negrini (BO), A. Sidoti (BO), S. Tupputi (CNAF) Not-ATLAS but helping a lot: F. Semeria (INFN-BO), M. Belgiovine (UNIBO) 27/05/15 L. Rinaldi – Triggering with GPU in ATLAS
3
The ATLAS trigger and DAQ
Composed of hardware based Level-1 (L1) and software based High Level Trigger(HLT) Reduces 40MHz input event rate to about 1kHz ( ~1500MB/s output) L1 identifies interesting activity in small geometrical regions of detector called Region of Interests (RoI) RoIs identified by L1 are passed to the HLT for event selection With about 25k processes ~25k/100kHz = ~250ms/Event decision time 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
4
High Level Trigger (HLT)
HLT uses Stepwise processing of sequences called Trigger Chains(TC) Each chain is composed of Algorithms and seeded by RoIs from L1 Same RoI may seed multiple chains Same Algorithm may run on different data An algorithm runs only once on same data (caching) If a RoI fails a selection step further algorithms in chain are not executed Initial algorithms (L2) work on partial event data (2-6%). Later algorithms have access to full event data L1 ROI L1 ROI L1 ROI L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg L2 Alg Event Building EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg EF Alg 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
5
L. Rinaldi – Trigger con GPU in ATLAS
The ATLAS Detector Run- 1 CPU measurements Combined HLT: 60% Tracking 20% Calo 10% Muon 10% other ATLAS is developing and expanding a demonstrator exploiting GPUs in: Inner Detector (ID)-Tracking Calorimeter-Topo-clustering Muon-Tracking to evaluate the potential benefit of GPUs in terms of throughput per unit cost 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
6
ATLAS GPU Demonstrator
Increasing LHC instantaneous luminosity is leading to higher pileup CPU time rises rapidly with pile-up due to combinatorial nature of the tracking HLT farm size is limited, mainly by power and cooling GPUs provide good power/compute ratio 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
7
L. Rinaldi – Trigger con GPU in ATLAS
Offloading Mechanism Trigger Processing Units integrate ATLAS offline software framework, ATHENA, to online environment Many PU processes run on each Trigger host A client-server approach is implemented to manage resources between multiple PU processes. PU prepares data to be processed and sends it to server Accelerator Process Extension (APE) Server manages offload requests and executes kernels on GPU(s) It sends results back to process that made the offload request Trigger PU (ATHENA) APE Server Trigger PU (ATHENA) Trigger PU (ATHENA) Data+Metadata Trigger PU (ATHENA) Results+Metadata Server can support different hardware types (GPUs, Xeon-Phi, CPUs) and different configurations such as GPU/Phi mixtures and in-host off-host accelerators. 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
8
L. Rinaldi – Trigger con GPU in ATLAS
Offloading - Client Implemented for each detector such as TrigInDetAccelSvc HLT Algorithm asks for offload to TrigDetAccelSvc TrigDetAccelSvc converts C++ classes for raw and reconstructed quantities from the Athena Event Data Model (EDM) to GPU optimized EDM through Data Export Tools Then it adds metadata and requests offload through OffloadSvc OffloadSvc manages multiple requests and communication with APE server Results are converted back to Athena EDM by TrigDetAccelSvc and handed to requesting algorithm HLT Algoritm 5 1 Athena EDM Data+MetaData TrigDetAccelSvc OffloadSvc 3 2 GPU EDM Export Tool Export Tool Request Result 4 Export tools and server communication need to be fast (serial sections) APE Server 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
9
L. Rinaldi – Trigger con GPU in ATLAS
Offloading - Server APE Server uses plug-in mechanism It is composed of Manager handles communication with processes and scheduling Receives offload requests Passes the request to appropriate Module Executes Work items Sends the results back to requesting process Modules manage GPU resources and create Work items Manage constants and time-varying data on GPU Bunch multiple requests to optimize utilization Create appropriate work items Work items are composed of the kernels which run on GPU and prepare results Each detector implements their own Module Athena Data + Metadata APE Server Manager Module Work Todo Queue Completed Queue Athena Results + Metadata 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
10
L. Rinaldi – Trigger con GPU in ATLAS
ID tracking Module Tracking is most time consuming part of trigger, done in four steps ByteStream Decoding Hit Clustering Track Formation Clone Removal Bytestream decoding converts detector encoded output to hits coordinates Each thread works on a word and decodes it. Data contains hits on on different ID layers Charged particles typically activate one or more neighboring strips or pixels. Hit clustering merges these by a Cellular Automata algorithm Each hit start as an independent cluster. Adjacent clusters are merged in each step until all adjacent cells belong to same cluster. Each thread works on a different hit 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
11
L. Rinaldi – Trigger con GPU in ATLAS
ID tracking Module Tracking starts with pair forming in inner layers. A 2D thread array checks for pairing condition and selects suitable pairs Then these pairs are extrapolated to outer layers by a 2D thread block to form triplets. Finally triplets are combined to form Track Candidates. In clone removal step, track candidates starting from different pairs but having same outer layer hits are merged to form tracks 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
12
L. Rinaldi – Trigger con GPU in ATLAS
ID tracking speedup In a previous Demonstrator, only the first two step implemented Raw detector data decoding and clusterization algorithms form the first step of reconstruction (data preparation) Initial tests with Monte-Carlo samples shown up to 21x speed up compared to serial Athena job. ~12x speedup with partial reconstruction made on GPU OFFLOAD is working! 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
13
Calorimeter Topo-clustering
Topo-clustering algorithm classifies calorimeter cells by signal/noise Growing cells around Seeds are included until they don’t have any more Growing or Terminal cells around them S/N>4 are Seeds 4>S/N>2 are Growing cells Parallel implementation uses Cellular Automata algorithm to parallelize the task 2>S/N>0 are Terminal cells Standard algorithm Parallel algorithm Assign one thread for each cell Start with adding Seeds to cluster. At each iteration join to the cluster which contain highest S/N neighboring cell Terminate when maximum radius is reached or can’t include anymore cells 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
14
L. Rinaldi – Trigger con GPU in ATLAS
Muons Several steps of reconstruction relying on many tools in Athena: Find segment in the Muon Spectrometer (<< Hough Transform) Build tracks in the MS Extrapolate to primary vertex obtaining a Stand-Alone muon Combine MS muons with ID tracks for a Combined muon 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
15
L. Rinaldi – Trigger con GPU in ATLAS
Previous HT-GPU study Circle detection using an algorithm based on Hough Transform (using GPGPU) Simulation of a typical central track detector: Multi-layer cylindrical shape, with axis on the beamline and centered on the nominal interaction point Uniform magnetic field with field lines parallel to the beamline Charged particles will have helix trajectories (circles in the transverse plane wrt detector axis) Charged particles HIT the detector HITs position used to reconstruct the trajectories (this study done by Bologna people is NOT related to ATLAS GPU demonstrator) 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
16
HT execution time vs CPU
GPUs vs CPU GPUs only C++ CUDA OpenCL GPU%CPU speed up over 15x CPU timing scales with number of hits GPU timing almost independent on number of hits This was a stand-alone simulation, need to take into account experimental framework interface and data format conversion 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
17
Summary and conclusions
Increasing instantaneous luminosity of LHC necessitates parallel processing Massive parallelization of trigger algorithms on GPU is being investigated as a way to increase the compute-power of the HLT farm ATLAS developed Accelerator Process Extension (APE) framework to manage offloading from Trigger processes Inner Detector Trigger Tracking algorithms have been successfully offloaded to GPU, resulting in ~12x speedup compared to CPU implementations (21x speedup for data preparation) Further algorithms for ID, Calorimeter and Muon systems are being implemented (speedup increase expected) Studies on Hough Transform algorithm execution on GPU promise good speed-up 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
18
L. Rinaldi – Trigger con GPU in ATLAS
backup 27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
19
L. Rinaldi – Trigger con GPU in ATLAS
27/05/15 L. Rinaldi – Trigger con GPU in ATLAS
20
CUDA 6.0 coding Hough Matrix filling (Vote): 1D grid over hits
( grid dimension = number of hits) At a given (f, q), threadblock over A . For each A, a corresponding B is evaluated The MH(f, q, A, B) Hough-Matrix element is incremented by a unity with CUDA atomicAdd() Matrix initialization once at first iteration with cudaMallocHost (pinned memory) and initialized on device with cudaMemset 10/09/14 L. Rinaldi - GPGPU for track finding and triggering in High Energy Physics
21
CUDA 6.0 coding Local Maxima search 2D-grid over (f, q)
Grid dimension: Nf×(Nq*NA*NB/maxThreadsPerBlock) 2D-threadblock, with dimXBlock= NA, dimYBlock=maxThreadsPerBlock/NA Each thread compares the MH(f, q, A, B) element to neighbours, the bigger is stored in the GPU shared memory and eventually transferred back. I/O demanding – several kernel may access matrix together 10/09/14 L. Rinaldi - GPGPU for track finding and triggering in High Energy Physics
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.