Download presentation
Presentation is loading. Please wait.
Published byAndrew McDaniel Modified over 8 years ago
1
GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks Cristiano Bozza Università di Salerno/INFN Pisa – 10/9/2014 on behalf of C. B. 1, Umut Kose 2, Simona Maria Stellacci 1, Chiara De Sio 1 1: University of Salerno 2: CERN
2
Nuclear emulsions as visualizing detectors Nuclear emulsions as data source – Used recently in CHORUS, DONUT, PEANUT, OPERA – Application to muon radiography of volcanoes and buildings (e.g. nuclear reactors, nuclear waste depots) Highest spatial resolution available – 0.1 µm, 1 mrad or better No time trigger Ideal test bench for tracking algorithms for many detectors Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN2 205 m 44 m plastic emulsion microtracks
3
Nuclear emulsions as visualizing detectors Automatic microscopes – SySal, ESS, QSS (Europe) – TS, NTS, UTS, S-UTS, HTS (Japan) Optical tomography: take data from a large volume by “scanning” it in views – XZ moving during data taking – Tracks normally span two views Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN3 x y z View #1View #2View #3View #4View #5 ESS/QSS
4
Nuclear emulsions as visualizing detectors Typical emulsion image – (from Q uick S canning S ystem) – FOV size: 770×550 µm 2, 31 images/view (tomography) – Grain diameter: 0.5 µm – “background” grains (“fog”, radioactivity): 25000/image – m.i.p. track grains: 0~10/image Task: find all 3D tracks! – Lighting is not uniform – No time trigger – No way to identify “good” grains before tracking Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN4 87 m 124 m 10 years’ background pile-up
5
This study was triggered by R&D on QSS – Current top speed: 41~90 cm 2 /h/side – Outlook with improved stage control: 150 cm 2 /h/side – Faster cameras: 250~300 cm 2 /h/side (image transmission speed sets the limit) Applications – High energy physics with topological study of events on µm~cm scale Neutrino physics Charm physics Tau physics Exotic particles with characteristic decay signatures – Muon radiography Stromboli Unzen Teide La Palma fault Nuclear emulsions as visualizing detectors Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN5
6
Data flow from automatic microscopes Relevant figures – QSS scanning speed: 41 or 90 cm 2 /h/side – Clusters of dark pixels/view (31 images): 5×10 5 – Grains (dark clusters with size constraints): 1.5×10 5 – Raw data (grains) / film (120 cm 2 ): 300 GB – Image data rate: 2 or 4 GB/s (2×,4× Camera Link protocol) – Raw data rate/microscope: 50 or 110 GB/h (110 or 250 Mbps) – Processed data (tracks as sequences of grains): amount depends on angular acceptance and film quality, but about 2 GB/film is realistic – Microscopes/laboratory: 2~10 Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN6
7
Data flow from automatic microscopes Living in a GPU-less world All numbers for 1 microscope, 20 cm 2 /h/side (ESS), “standard” angular acceptance – 2D image preprocessing by FPGA device: Matrox Odyssey – Dark cluster detection by host CPU: 4 cores – 3D tracking by networked servers: 50 cores – Processing hardware cost: ~40 k€ GPU-powered data acquisition All numbers for 1 microscope, 41 cm 2 /h/side (QSS), “standard” angular acceptance – 2D image processing by GPU on host PC: NVidia GTX 590/690 – 3D tracking by GPU-powered servers: 6×GTX 690 (18432 cores) – Temporary staging area (RAMDisk) for raw data: 32 GB – Processing hardware cost: 7 k€ (GTX 690) Includes cost of host workstation Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN7
8
Data flow from automatic microscopes Work organisation with GPU Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN8 GTX 590/690 hosted in microscope workstation Temporary storage server Ensures constant flow Manages job allocation Dynamic reconfiguration Tracking servers host 1 or 2 GTX 690 each Data protocol: networked file system Control protocol: HTTP + SAWI (Server Application with Web Interface) Integrates web interface and interprocess communication
9
From images to microtracks Step #1: from images to dark clusters – GPU on Data acquisition workstation – Each image is treated separately – Need feedback to drive Z axis (check emulsion entry/exit surfaces) Step #2: from sets of dark clusters to grains – GPU’s on tracking servers – Correct optical aberrations – Correct vibrations and motion effects Step #3: from grains to microtracks – GPU’s on tracking servers – Find microtracks = sequences of aligned grains – Algorithms could be ported to other detectors Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN9
10
Step#1: From images to dark clusters Image upload to GPU – The images in the same view (31 for 44 µm) are uploaded together (~124 MB bunch) Equalization of grey-level histogram – 5 kernels Convolution with a 5×5 FIR filter and threshold – 1 kernel (1 thread per output pixel) Find horizontal “segments” of dark pixels – 2 kernels (1 thread per horizontal line) Assembling dark clusters from segments – 5 kernels (1 thread per horiz. line, recursive) Output to host memory Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN10 grey (0-255) 11111 1-21 1-2-4-21 1-21 11111
11
Step#1: From images to dark clusters Image upload to GPU Equalization of grey-level histogram Convolution with a 5×5 FIR filter and threshold Output to host memory – Comparison with FPGA devices: Matrox Odyssey (v1 and v2) Full processing: 2.5 ms/MB (GTX 590) – Includes segments + clusters – ~10 ms for 4 MPixel image Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN11 2011 prices
12
Step#2: From dark clusters to grains Dark cluster data upload to GPU – The dark clusters in the same view are uploaded together (raw data file) Correction of optical aberrations – 5 kernels Correction of “in-view” alignment – 23 kernels – The X and Z axes move during readout The mechanics is not perfectly rigid Vibrations in the XY plane can occur Pattern matching of clusters seen in consecutive images yields highest precision Correction of “cross-view” alignment – 26 kernels – Overall misalignment due to vibrations is corrected by 3D pattern matching of clusters in overlap region between views Clusters can be merged to form grains – Useful in some operational conditions – Z is obtained by weighted average Output to host memory is optional – Data immediately reused for tracking – Dump to host memory used only for debugging Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN12 XY curvature Z curvature Z axis slant (X and Y) XY trapezium Magnification vs. Z Image n Image n+1 y x x z View n View n+1
13
Step#2: From dark clusters to grains Results of pattern matching Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN13 mm XY alignment : 0.12 µm mm XY alignment : 0.15 µm Z alignment : 2.6 µm In-view image-to-image alignment Cross-view alignment
14
Reducing combinations – Check only neighbouring grains – Check only combinations within a defined angular tolerance – “Constructively” enforce the constraints: browsing combinations and discarding them is already a waste of time! Step #3: From grains to microtracks Track recognition: search for aligned grains – Images of grains w.r.t. straight line fit: σ = 50 nm – All grains in a track should lie within a cylinder defined by two grains (track “seed” ) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN14 Combinatorial complexity – 6 30 grains per track – N 5×10 5 /view – N 2 possible tracks – N 3 possible grains in tracks Cylinder geometry available in different “flavours”: XY distance from axis XY+weighted Z XYZ distance from axis
15
Step #3: From grains to microtracks Grain proximity in position/direction space – In 2D: arrange grains in a grid of cells, check proximity only within same cell (or nearest neighbours) – In 3D: scan the angular acceptance region in fixed steps For each direction step, define a set of skewed prisms Arrange grains in the skewed prisms and check for tracks only within each prism – Size of prisms and angular step are chosen by fine-tuning – Tracking time is proportional to angular acceptance (in this presentation, 1.32 sr 11% of 4 ) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN15
16
Step #3: From grains to microtracks The tracking algorithm can be easily adapted to other detectors – Stacked planes of 2D pixels resembles Z layers – Volume detectors grains are always treated as (e.g. Liquid Argon)3D entities – 4 angular acceptance in this “flavour”, no prisms are used to constrain the slope, but a limit on track length is set (deviation from straight fit due to multiple scattering, bremsstrahlung, etc.) Long tracks are obtained by stitching “short” pieces in the track merging stage Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN16
17
Step #3: From grains to microtracks Pitfalls of GPU-coding for this algorithm – #1 Filling prisms If each thread corresponds to one grain, the risk of “collisions” (i.e. threads accessing the same prism at the same time) is very high “Atomic” functions (CUDA 1.1 or higher) can be used to settle “race” conditions With too many collisions the code becomes “quasi-serial” “Striding” threads: each thread handles a sequential block of n grains, to increase the chance that they access different prisms Drawback of thread striding: memory access is poorly coalesced (but tracks are not known in advance, and the memory span is broad) The code is non-deterministic: the exact order of filling is not specified, while it is ensured that all prisms will contain the right set of grains Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN17
18
Step #3: From grains to microtracks Pitfalls of GPU-coding for this algorithm – #2 Seed scanning One seed is formed by a pair of grains in the same prism If each thread corresponds to one prism, with an average fill of N grains the fluctuations will be O(N 1/2 ) The fluctuations in seeds will be O(N 3/2 ) – example: if the number of grains fluctuates from 6 to 12, the number of pairs fluctuates from 15 to 66!!! Only a few threads will be running while all others have completed Allocate one thread per seed A seed in a crowded prism will still take more time because of more grains to check, but the fluctuation is only O(N 1/2 ) Further optimisation is possible, but not worth the effort Drawback: memory access is poorly coalesced (but tracks are not known in advance, and the memory span is broad) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN18
19
Step #3: From grains to microtracks Pitfalls of GPU-coding for this algorithm – #3 Track merging In the ideal case, a track with n grains has n(n-1)/2 seeds The same track is reconstructed several times Track “clones” must be merged Tracks are checked in pairs comparing position and direction The track with fewer grains is suppressed (or the “second” in case of a tie) Reduction of combinations by proximity (tracks are stored in a grid of XY cells) The code is non-deterministic: the order of tracks matters in producing the result The “quality” of the set is always the same, but small differences can arise (0.1%) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN19
20
Performances: view correction/mapping Time spent in image/view correction Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN20 GTX640 GTX690 (1/2) GTX780Ti Tesla C2050 Dark clusters Time (ms)
21
Performances: view correction/mapping Most time is spent in tracking – Fraction of total time (Steps 1+2+3) spent in cluster-grain processing (Steps 1+2) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN21
22
Performances: tracking Tracking time vs. grains/view Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN22 grains Time (ms) GTX640 GTX690 (1/2) GTX780Ti Tesla C2050
23
Performances: tracking Tracking time vs. tracks/view Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN23 tracks Time (ms) No visible nonlinear bottlenecks GTX640 GTX690 (1/2) GTX780Ti Tesla C2050
24
Performances: tracking Tracking time vs. tracks/view Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN24 Log10(Time) GTX640 Log10(grains/view) Exponent: 1.88 Log10(Time) Log10(grains/view) GTX780Ti Exponent: 1.71 The dependency is better than N 2 : the weight of computation stages with high combinatorial complexity is relatively small
25
Performances: tracking Compute work vs. grains – Computework := Time(ms)×Cores×Clock(MHz) – More recent architectures seem less efficient Effect of branch divergence with more cores/multiprocessor? Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN25 Log10(grains/view) Log10(computework) GTX640 (Fermi 2.1) GTX690 (1/2) (Kepler 3.0) GTX780Ti (Kepler 3.5) Tesla C2050 (Fermi 2.0)
26
Performances: tracking Identifying and understanding bottlenecks – Data from GTX640 – Branch divergence is almost only end-wait (completed threads wait for others running) Difficult to improve without additional complications in the code May be worth the effort for Maxwell architecture – Track merging jumps over “Track” data blocks that are anyway large – Memory could be coalesced by reshuffling thread order – under study – Track fitting is negligible w.r.t. track recognition Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN26
27
Conclusions The solution shown uses GPU’s to implement a complex algorithm with many logical branches and non-trivial memory access patterns The tracking portion (Step #3) is suitable for a wide range of types of straight tracks (magnetic field weak or absent) Tracking planes or volume detectors natively supported “Know your code”: GPU’s are effective, but efficient solutions need careful optimisation The algorithm performances scale well with data size There is room to improve the performances with the latest generation of boards Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.