“GPU in HEP: online high quality trigger processing”

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Niko Neufeld, CERN/PH-Department
Computer Graphics Graphics Hardware
Extracted directly from:
Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
“MultiRing Almagest with GPU” Gianluca Lamanna (CERN) Mainz Collaboration meeting TDAQ WG.
LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
“GPU for realtime processing in HEP experiments” Piero Vicini (INFN) On behalf of GAP collaboration Beijing,
APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,
K + → p + nn The NA62 liquid krypton electromagnetic calorimeter Level 0 trigger V. Bonaiuto (a), A. Fucci (b), G. Paoluzzi (b), A. Salamon (b), G. Salina.
Computer Graphics Graphics Hardware
CUDA C/C++ Basics Part 2 - Blocks and Threads
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
These slides are based on the book:
M. Bellato INFN Padova and U. Marconi INFN Bologna
“Heterogeneous computing for future triggering”
Generalized and Hybrid Fast-ICA Implementation using GPU
New Track Seeding Techniques at the CMS experiment
CS427 Multicore Architecture and Parallel Computing
Enabling machine learning in embedded systems
ALICE – First paper.
Basic CUDA Programming
Architecture & Organization 1
“Use of GPU in realtime”
Programming Massively Parallel Graphics Processors
Computer-Generated Force Acceleration using GPUs: Next Steps
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Architecture & Organization 1
Example of DAQ Trigger issues for the SoLID experiment
NVIDIA Fermi Architecture
Programming Massively Parallel Graphics Processors
Computer Graphics Graphics Hardware
The LHCb Level 1 trigger LHC Symposium, October 27, 2001
Multicore and GPU Programming
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Presentation transcript:

“GPU in HEP: online high quality trigger processing” ISOTDAQ 7.2.2017 Amsterdam Gianluca Lamanna (INFN)

The World in 2035

FCC (Future Circular Collider) is only an example The problem in 2035 FCC (Future Circular Collider) is only an example Fixed target, Flavour factories, … the physics reach will be defined by trigger! What the triggers will look like in 2035?

… will be similar to the current trigger… The trigger in 2035… … will be similar to the current trigger… High reduction factor High efficiency for interesting events Fast decision High resolution …but will be also different… The higher background and Pile Up will limit the ability to trigger on interesting events The primitives will be more complicated with respect today: tracks, clusters, rings

All of these effects go in the same direction The trigger in 2035… Higher energy Resolution for high pt leptons → high-precision primitives High occupancy in forward region → better granularity Higher luminosity track-calo correlation Bunch crossing ID becomes challenging, pile up All of these effects go in the same direction More resolution & more granularity  more data & more processing

Classic trigger in the future? Is a traditional “pipelined” trigger possible? Yes and no Cost and dimension Getting all data in one place New links -> data flow No “slow” detectors can participate to trigger (limited latency) Pre-processing on-detector could help FPGA: not suitable for complicated processing Software: commodity hw Main limitation: high quality trigger primitives generation on detector (processing) [R.Ferrari’s talk]

Classic Pipeline: Processing The performances of FPGA as computing device depends on the problem The increasing in computing capability in “standard” FPGA is not as fast as CPU This scenario would change in the future with the introduction of new FPGA+CPU hybrid devices

Is it possible to bring all data on PCs? Triggerless? Is it possible to bring all data on PCs? LHCb: yes in 2020 40 MHz readout, 30 Tb/s data network, 4000 cores, 8800 links No in 2035: track+calo=2PB/s + 5 PB/s ev.building (for comparison largest Google data center = 0.1 PB/s) CMS & ATLAS: probably no (in 2035) 4 PB/s data, 4M links, x10 in performance for switch, x2000 computing Main limitation: data transport

Triggerless: Data Links The links bandwidth is steadily increasing But the power consumption is not compatible with HEP purposes (rad hard serializers): e.g. lpGBT is 500mW per 5Gb/s 4M links  2 MW only for links on detector Nowadays standard market is not interested in this application. [P. Durante’s Talk]

Example: an alternative approach Toroidal Network Buffer DET Merging Disk Triggerless: Focus on Data Links High Latency Trigger: Heterogeneous computing nodes Toroidal network Time multiplexed trigger Trigger implemented in software Large buffers Focus on On- detector Buffers Classic pipeline: Focus on On-detector processing

High Latency: Memories The memories price decrease very fast with respect to the CPUs. Moore’s law for memories predicts 10% increase per year

GPU: Graphics Processing Units

Parallel computing is no longer something for SuperComputers Parallel programming Parallel computing is no longer something for SuperComputers All the processors nowadays are multicores The use of parallel architectures is mainly due to the physical constraints to frequency scaling

Parallel computing Several problems can be split in smaller problems to be solved concurrently In any case the maximum speed-up is not linear , but it depends on the serial part of the code ( Amdahls’s law) The situation can improve if the amount of parallelizable part depends on the resources ( Gustafson’s Law)     [E.Ozcan talk]

Parallel programming on GPU The GPUs are processors dedicated to parallel programming for graphical application. Rendering, Image transformation, ray tracing, etc. are typical application where parallelization can helps a lot.

What are the GPUs? The technical definition of a GPU is "a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second.“ The possibility to use the GPU for generic computing (GPGPU) has been introduced by NVIDIA in 2007 (CUDA) In 2008 OpenCL: consortium of different firms to introduce a multi-platform language for manycores computing. (1997) (2016)

GPU is a way to cheat the Moore’s law Why the GPUs? GPU is a way to cheat the Moore’s law SIMD/SIMT parallel architecture The PC no longer get faster, just wider. Very high computing power for «vectorizable» problems Impressive derivative almost a factor of 2 in each generation Continuous development Easy to have a desktop PC with teraflops of computing power, with thousand of cores. Several applications in HPC, simulation, scientific computing…

Computing power

Why?: CPU vs GPU

Multilevel and Large Caches CPU Multilevel and Large Caches Convert long latency memory access Branch prediction To reduce latency in branching Powerful ALU Memory management Large control part CPU: latency oriented design

SIMT (Single instruction Multiple Thread) architecture GPU SIMT (Single instruction Multiple Thread) architecture SMX (Streaming Multi Processors) to execute kernels Thread level parallelism Limited caching Limited control No branch prediction, but branch predication GPU: throughput oriented design

CPU + GPU = Heterogeneous Computing The winning application uses both CPU and GPU CPUs for sequential parts (can be 10X faster than GPU for sequential code) GPUs for parallel part where throughput wins (can be 100X faster than CPU for parallel code)

Processing structure with CUDA What is CUDA? It is a set of C/C++ extensions to enable the GPGPU computing on NVIDIA GPUs Dedicated APIs allow to control almost all the functions of the graphics processor Three steps: 1) copy data from Host to Device 2) copy Kernel and execute 3) copy back results

Grids, blocks and threads The computing resources are logically (and physically) grouped in a flexible parallel model of computation: 1D,2D and 3D grid With 1D, 2D and 3D blocks With 1D, 2D and 3D threads Only threads can communicate and synchronize in a block Threads in different blocks do not interact, threads in same block execute same instruction at the same time The “shape” of the system is decided at kernel launch time [E.Ozcan talk]

Mapping on the hardware GP100

The memory hierarchy is fundamental in GPU programming Most of the memory managing and data locality is left to the user Unified Address Space Global Memory On board, relatively slow, lifetime of the application, accessible from host and device Shared memory/registers On Chip, very fast, lifetime of blocks/threads, accessible from kernel only [P. Durante’s Talk]

The main purpose of all the GPU computing is to hide the latency Streams The main purpose of all the GPU computing is to hide the latency In case of multiple data transfer from host to device the asynchronous data copy and kernel execution can be superimposed to avoid dead time

A simple example: dot product Simple excercice: dot product Two parts: Compute each elements Sum all the elements On CPU each part is O(n) On GPU the first part is O(1), the parallel reduction is O(logN)  

A simple example: dot product #define imin(a,b) (a<b?a:b) const int N = 33 * 1024; const int threadsPerBlock = 256; const int blocksPerGrid = imin( 32, (N+threadsPerBlock-1) / threadsPerBlock ); __global__ void dot( float *a, float *b, float *c ) { __shared__ float cache[threadsPerBlock]; int tid = threadIdx.x + blockIdx.x * blockDim.x; int cacheIndex = threadIdx.x; float temp = 0; while (tid < N) { temp += a[tid] * b[tid]; tid += blockDim.x * gridDim.x; } // set the cache values cache[cacheIndex] = temp; // synchronize threads in this block __syncthreads(); // for reductions, threadsPerBlock must be a power of 2 // because of the following code int i = blockDim.x/2; while (i != 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; i /= 2; if (cacheIndex == 0) c[blockIdx.x] = cache[0]; }//end kernel Kernel Thread ID element calculation Synchronization wall Reduction

A simple example: dot product int main( void ) { float *a, *b, c, *partial_c; float *dev_a, *dev_b, *dev_c; // allocate memory on the CPU side a = (float*)malloc( N*sizeof(float) ); b = (float*)malloc( N*sizeof(float) ); c = (float*)malloc( blocksPerGrid*sizeof(float) ); // allocate the memory on the GPU cudaMalloc( (void**)&dev_a, N*sizeof(float) ); cudaMalloc( (void**)&dev_b,N*sizeof(float) ) ; cudaMalloc( (void**)&dev_c, blocksPerGrid*sizeof(float) ) ; // fill in the host memory with data for (int i=0; i<N; i++) { a[i] = i; b[i] = i*2; } // copy the arrays ‘a’ and ‘b’ to the GPU cudaMemcpy( dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice ) ; cudaMemcpy( dev_b, b, N*sizeof(float), Host Host mem Device mem Copy data in to device

A simple example: dot product Kernel Call dot<<<blocksPerGrid,threadsPerBlock>>>(dev_a,dev_b,dev_c ); // copy the array 'c' back from the GPU to the CPU cudaMemcpy(c, dev_c, blocksPerGrid*sizeof(float), cudaMemcpyDeviceToHost ) ); // finish up on the CPU side // . . . Do something // free memory on the GPU side cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); // free memory on the CPU side free( a ); free( b ); free( c ); } //end main Copy results

Avoid branches as much as possible Tips & Tricks Avoid branches as much as possible Organize data in memory in order to use coalescence Avoid frequent access to global memory Use atomics Fill GPU with a lot of work, to hide latency Use profiling tools to look for bottlenecks (Visual Profiler, Nsight, etc.)

Other ways to program GPU CUDA is the “best” way to program NVIDIA GPU at “low level” If your code is almost CPU or if you need to accelerate dedicated functions, you could consider to use Directives (OpenMP, OpenACC, …) Libraries (Thrust, ArrayFire,…) OpenCL is a framework equivalent to CUDA to program multiplatforms (GPU, CPU, DSP, FPGA,…). NVIDIA GPUs supports OpenCL.

Trigger and GPUs

Next generation trigger Next generation experiments will look for tiny effects: The trigger systems become more and more important Higher readout band New links to bring data faster on processing nodes Accurate online selection High quality selection closer and closer to the detector readout Flexibility, Scalability, Upgradability More software less hardware

Where to use GPU in the trigger? In High Level Trigger It is the “natural” place. If your problem can be parallelized (either for events or for algorithm) you can gain factor on speed-up  smaller number of PC in Online Farm Few examples in backup slides In Low Level Trigger Bring power and flexibility of processors close to the data source  more physics [F.Pastore’s Talk]

Different Solutions Brute force: PCs Rock Solid: Custom Hardware Bring all data on a huge pc farm, using fast (and eventually smart) routers. Pro: easy to program, flexibility; Cons: very expensive, most of resources just to process junk. Rock Solid: Custom Hardware Build your own board with dedicated processors and links Pro: power, reliability; Cons: several years of R&D (sometimes to re-rebuild the wheel), limited flexibility Elegant: FPGA Use a programmable logic to have a flexible way to apply your trigger conditions. Pro: flexibility and low deterministic latency; Cons: not so easy (up to now) to program, algorithm complexity limited by FPGA clock and logic. Off-the-shelf: GPU Try to exploit hardware built for other purposes continuously developed for other reasons Pro: cheap, flexible, scalable, PC based. Cons: Latency [Sakulin’s & Marin’s talks] [K.Kordas talk]

GPU in low level trigger Latency: Is the GPU latency per event small enough to cope with the tiny latency of a low level trigger system? Is the latency stable enough for usage in synchronous trigger systems? Computing power: Is the GPU fast enough to take trigger decision at tens of MHz events rate?

Low Level trigger: NA62 Test bench Fixed target experiment on SPS (slow extraction) Look for ultra-rare kaon decays (K->pi nu nubar) RICH: 17 m long, 3 m in diameter, filled with Ne at 1 atm Reconstruct Cherenkov Rings to distinguish between pions and muons from 15 to 35 GeV 2 spots of 1000 PMs each Time resolution: 70 ps MisID: 5x10-3 10 MHz events: about 20 hits per particle

TEL62 512 HPTDC channels 5 FPGAs DDR2 memories for readout buffer Readout data are used for trigger primitives Data and primitives transmission through eth (UDP)

The NA62 “standard” TDAQ system L0 trigger Trigger primitives Data CDR O(KHz) EB GigaEth SWITCH L1/L2 PC RICH MUV CEDAR LKR STRAWS LAV L0TP L0 1 MHz 10 MHz 100 kHz L1 trigger L1/2 L0: Hardware synchronous level. 10 MHz to 1 MHz. Max latency 1 ms. L1: Software level. “Single detector”. 1 MHz to 100 kHz L2: Software level. “Complete information level”. 100 kHz to few kHz.

Latency: main problem of GPU computing Host PC Total latency dominated by double copy in Host RAM Decrease the data transfer time: DMA (Direct Memory Access) Custom manage of NIC buffers “Hide” some component of the latency optimizing the multi-events computing VRAM DMA NIC GPU Custom Driver PCI express chipset CPU RAM

ALTERA Stratix V dev board (TERASIC DE5-Net board) NaNet-10 ALTERA Stratix V dev board (TERASIC DE5-Net board) PCIe x8 Gen3 (8 GB/s) 4 SFP+ ports (Link speed up to 10Gb/s) GPUDirect /RDMA UDP offload support 4x10 Gb/s Links Stream processing on FPGA (merging, decompression, …) Working on 40 GbE (foreseen 100 GbE)

NaNet-10 VCI 2016 16/02/2016 44

NA62 GPU trigger system Readout event: 1.5 kb (1.5 Gb/s) 2024 TDC channels, 4 TEL62 GPU TEL62 Nanet-10 Readout event: 1.5 kb (1.5 Gb/s) GPU reduced event: 300 b (3 Gb/s) 8x1Gb/s links for data readout 4x1Gb/s Standard trigger primitives 4x1Gb/s GPU trigger GPU NVIDIA K20: 2688 cores 3.9 Teraflops 6GB VRAM PCI ex.gen3 Bandwidth: 250 GB/s Events rate: 10 MHz L0 trigger rate: 1 MHz Max Latency: 1 ms Total buffering (per board): 8 GB Max output bandwidth (per board): 4 Gb/s

Multi rings on the market: Ring fitting problem Trackless no information from the tracker Difficult to merge information from many detectors at L0 Fast Not iterative procedure Events rate at levels of tens of MHz Low latency Online (synchronous) trigger Accurate Offline resolution required Multi rings on the market: With seeds: Likelihood, Constrained Hough, … Trackless: fiTQun, APFit, possibilistic clustering, Metropolis-Hastings, Hough transform, …

Almagest New algorithm (Almagest) based on Ptolemy’s theorem: “A quadrilater is cyclic (the vertex lie on a circle) if and only if is valid the relation: AD*BC+AB*DC=AC*BD “ Design a procedure for parallel implementation

Almagest i) Select a triplet (3 starting points) A ii) Loop on the remaining points: if the next point does not satisfy the Ptolemy’s condition then reject it D D B D vi) Repeat by excluding the already used points iii) If the point satisfy the Ptolemy’s condition then consider it for the fit C v) Perform a single ring fit iv) …again…

The XY plane is divided in a Grid Histogram The XY plane is divided in a Grid The histograms of the distances is created for each point in the grid

Processing flow gathering Data are received (~10us) Pre-processing (<10us): Protocol offload Decompression Merging Copy through PCIe in VRAM Event signal to GPU Computing GPU computing Results available GPU NaNet-10 DMA Clop TEL62 – J1 KERNEL TEL62 – J2 TEL62 – S1 Processing Send, Re-synchro (for realtime), … TEL62 – S2

~ 25% target beam intensity (9*1011 Pps) Gathering time: 350ms Results in 2016 NA62 Run Testbed Supermicro X9DRG-QF Intel C602 Patsburg Intel Xeon E5-2602 2.0 GHz 32 GB DDR3 nVIDIA K20c ~ 25% target beam intensity (9*1011 Pps) Gathering time: 350ms Processing time per event: 1 us (to be improved) Processing latency: below 200 us (compatible with the NA62 requirements)

Conclusions In 2035 the trigger will look different than today: more computing power is unavoidable Heterogeneous computing with GPU could be a possible solution to help to design new architectures The migration towards COTS is a trend in our job and triggering with processors built for other purposes can give several advantages In addition program GPUs is quite funny… for several reasons… THANK YOU!!!

R. Ammendola(a), A. Biagioni(b), P. Cretaro(b), S. Di Lorenzo(c) O R. Ammendola(a), A. Biagioni(b), P. Cretaro(b), S. Di Lorenzo(c) O. Frezza(b), G. Lamanna(d), F. Lo Cicero(b), A. Lonardo(b), M. Martinelli(b), P. S. Paolucci(b), E. Pastorelli(b), R. Piandani(f), L. Pontisso(d), D. Rossetti(e), F. Simula(b), M. Sozzi(c), P. Valente(b), P. Vicini(b) (a) INFN Sezione di Roma Tor Vergata (b) INFN Sezione di Roma (c) INFN Sezione di Pisa (d) INFN LNF (e) nVIDIA Corporation, USA

SPARES

Computing vs LUT LUT Sin, cos, log, … Higgs Analysis 2x2=4 processors Complexity Where is this limit? It depends … In any case the GPUs aim to shrink this space

Accelerators: co-processors for intensive computing Nowdays co-processors are connected through standard bus

Computing power

GPU in ATLAS HLT

GPU in ATLAS HLT

GPU in ATLAS HLT: results Tracking Good quality wrt ref.! Muon Calorimeter

Alice: HLT TPC online Tracking 2 kHz input at HLT, 5x107 B/event, 25 GB/s, 20000 tracks/event Cellular automaton + Kalman filter GTX 580

Mu3e Possibly a “trigger-less” approach High rate: 2x109 tracks/s >100 GB/s data rate Data taking will start >2016

CBM & STAR 107 Au+Au collisions /s ~1000 tracks/event trigger-less Since the continuos structure of the beam ~10000 tracks/frame Cellular automaton+KF Grid=100000 cores

PANDA 107 events/s Full reconstruction for online selection: assuming 1-10 ms  10000 – 100000 CPU cores Tracking, EMC, PID,… First exercice: online tracking Comparison between the same code on FPGA and on GPU: the GPUs are 30% faster for this application (a factor 200 with respect to CPU) 1 TB/s 1 GB/s