GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks Cristiano Bozza Università di Salerno/INFN Pisa – 10/9/2014.

GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks Cristiano Bozza Università di Salerno/INFN Pisa – 10/9/2014 on behalf of C. B. 1, Umut Kose 2, Simona Maria Stellacci 1, Chiara De Sio 1 1: University of Salerno 2: CERN

Nuclear emulsions as visualizing detectors Nuclear emulsions as data source – Used recently in CHORUS, DONUT, PEANUT, OPERA – Application to muon radiography of volcanoes and buildings (e.g. nuclear reactors, nuclear waste depots) Highest spatial resolution available – 0.1 µm, 1 mrad or better No time trigger Ideal test bench for tracking algorithms for many detectors Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN2 205  m 44  m plastic emulsion microtracks

Nuclear emulsions as visualizing detectors Automatic microscopes – SySal, ESS, QSS (Europe) – TS, NTS, UTS, S-UTS, HTS (Japan) Optical tomography: take data from a large volume by “scanning” it in views – XZ moving during data taking – Tracks normally span two views Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN3 x y z View #1View #2View #3View #4View #5 ESS/QSS

Nuclear emulsions as visualizing detectors Typical emulsion image – (from Q uick S canning S ystem) – FOV size: 770×550 µm 2, 31 images/view (tomography) – Grain diameter: 0.5 µm – “background” grains (“fog”, radioactivity): 25000/image – m.i.p. track grains: 0~10/image Task: find all 3D tracks! – Lighting is not uniform – No time trigger – No way to identify “good” grains before tracking Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN4 87  m 124  m 10 years’ background pile-up

This study was triggered by R&D on QSS – Current top speed: 41~90 cm 2 /h/side – Outlook with improved stage control: 150 cm 2 /h/side – Faster cameras: 250~300 cm 2 /h/side (image transmission speed sets the limit) Applications – High energy physics with topological study of events on µm~cm scale Neutrino physics Charm physics Tau physics Exotic particles with characteristic decay signatures – Muon radiography Stromboli Unzen Teide La Palma fault Nuclear emulsions as visualizing detectors Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN5

Data flow from automatic microscopes Relevant figures – QSS scanning speed: 41 or 90 cm 2 /h/side – Clusters of dark pixels/view (31 images): 5×10 5 – Grains (dark clusters with size constraints): 1.5×10 5 – Raw data (grains) / film (120 cm 2 ): 300 GB – Image data rate: 2 or 4 GB/s (2×,4× Camera Link protocol) – Raw data rate/microscope: 50 or 110 GB/h (110 or 250 Mbps) – Processed data (tracks as sequences of grains): amount depends on angular acceptance and film quality, but about 2 GB/film is realistic – Microscopes/laboratory: 2~10 Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN6

Data flow from automatic microscopes Living in a GPU-less world All numbers for 1 microscope, 20 cm 2 /h/side (ESS), “standard” angular acceptance – 2D image preprocessing by FPGA device: Matrox Odyssey – Dark cluster detection by host CPU: 4 cores – 3D tracking by networked servers: 50 cores – Processing hardware cost: ~40 k€ GPU-powered data acquisition All numbers for 1 microscope, 41 cm 2 /h/side (QSS), “standard” angular acceptance – 2D image processing by GPU on host PC: NVidia GTX 590/690 – 3D tracking by GPU-powered servers: 6×GTX 690 (18432 cores) – Temporary staging area (RAMDisk) for raw data: 32 GB – Processing hardware cost: 7 k€ (GTX 690) Includes cost of host workstation Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN7

Data flow from automatic microscopes Work organisation with GPU Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN8 GTX 590/690 hosted in microscope workstation Temporary storage server Ensures constant flow Manages job allocation Dynamic reconfiguration Tracking servers host 1 or 2 GTX 690 each Data protocol: networked file system Control protocol: HTTP + SAWI (Server Application with Web Interface) Integrates web interface and interprocess communication

From images to microtracks Step #1: from images to dark clusters – GPU on Data acquisition workstation – Each image is treated separately – Need feedback to drive Z axis (check emulsion entry/exit surfaces) Step #2: from sets of dark clusters to grains – GPU’s on tracking servers – Correct optical aberrations – Correct vibrations and motion effects Step #3: from grains to microtracks – GPU’s on tracking servers – Find microtracks = sequences of aligned grains – Algorithms could be ported to other detectors Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN9

Step#1: From images to dark clusters Image upload to GPU – The images in the same view (31 for 44 µm) are uploaded together (~124 MB bunch) Equalization of grey-level histogram – 5 kernels Convolution with a 5×5 FIR filter and threshold – 1 kernel (1 thread per output pixel) Find horizontal “segments” of dark pixels – 2 kernels (1 thread per horizontal line) Assembling dark clusters from segments – 5 kernels (1 thread per horiz. line, recursive) Output to host memory Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN10 grey (0-255) 11111 1-21 1-2-4-21 1-21 11111

Step#1: From images to dark clusters Image upload to GPU Equalization of grey-level histogram Convolution with a 5×5 FIR filter and threshold Output to host memory – Comparison with FPGA devices: Matrox Odyssey (v1 and v2) Full processing: 2.5 ms/MB (GTX 590) – Includes segments + clusters – ~10 ms for 4 MPixel image Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN11 2011 prices

Step#2: From dark clusters to grains Dark cluster data upload to GPU – The dark clusters in the same view are uploaded together (raw data file) Correction of optical aberrations – 5 kernels Correction of “in-view” alignment – 23 kernels – The X and Z axes move during readout The mechanics is not perfectly rigid Vibrations in the XY plane can occur Pattern matching of clusters seen in consecutive images yields highest precision Correction of “cross-view” alignment – 26 kernels – Overall misalignment due to vibrations is corrected by 3D pattern matching of clusters in overlap region between views Clusters can be merged to form grains – Useful in some operational conditions – Z is obtained by weighted average Output to host memory is optional – Data immediately reused for tracking – Dump to host memory used only for debugging Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN12 XY curvature Z curvature Z axis slant (X and Y) XY trapezium Magnification vs. Z Image n Image n+1 y x x z View n View n+1

Step#2: From dark clusters to grains Results of pattern matching Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN13 mm XY alignment  : 0.12 µm mm XY alignment  : 0.15 µm Z alignment  : 2.6 µm In-view image-to-image alignment Cross-view alignment

Reducing combinations – Check only neighbouring grains – Check only combinations within a defined angular tolerance – “Constructively” enforce the constraints: browsing combinations and discarding them is already a waste of time! Step #3: From grains to microtracks Track recognition: search for aligned grains – Images of grains w.r.t. straight line fit: σ = 50 nm – All grains in a track should lie within a cylinder defined by two grains (track “seed” ) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN14 Combinatorial complexity – 6  30 grains per track – N  5×10 5 /view – N 2 possible tracks – N 3 possible grains in tracks Cylinder geometry available in different “flavours”: XY distance from axis XY+weighted Z XYZ distance from axis

Step #3: From grains to microtracks Grain proximity in position/direction space – In 2D: arrange grains in a grid of cells, check proximity only within same cell (or nearest neighbours) – In 3D: scan the angular acceptance region in fixed steps For each direction step, define a set of skewed prisms Arrange grains in the skewed prisms and check for tracks only within each prism – Size of prisms and angular step are chosen by fine-tuning – Tracking time is proportional to angular acceptance (in this presentation, 1.32 sr  11% of 4  ) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN15

Step #3: From grains to microtracks The tracking algorithm can be easily adapted to other detectors – Stacked planes of 2D pixels  resembles Z layers – Volume detectors  grains are always treated as (e.g. Liquid Argon)3D entities – 4  angular acceptance  in this “flavour”, no prisms are used to constrain the slope, but a limit on track length is set (deviation from straight fit due to multiple scattering, bremsstrahlung, etc.) Long tracks are obtained by stitching “short” pieces in the track merging stage Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN16

Step #3: From grains to microtracks Pitfalls of GPU-coding for this algorithm – #1 Filling prisms If each thread corresponds to one grain, the risk of “collisions” (i.e. threads accessing the same prism at the same time) is very high “Atomic” functions (CUDA 1.1 or higher) can be used to settle “race” conditions With too many collisions the code becomes “quasi-serial” “Striding” threads: each thread handles a sequential block of n grains, to increase the chance that they access different prisms Drawback of thread striding: memory access is poorly coalesced (but tracks are not known in advance, and the memory span is broad) The code is non-deterministic: the exact order of filling is not specified, while it is ensured that all prisms will contain the right set of grains Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN17

Step #3: From grains to microtracks Pitfalls of GPU-coding for this algorithm – #2 Seed scanning One seed is formed by a pair of grains in the same prism If each thread corresponds to one prism, with an average fill of N grains the fluctuations will be O(N 1/2 ) The fluctuations in seeds will be O(N 3/2 ) – example: if the number of grains fluctuates from 6 to 12, the number of pairs fluctuates from 15 to 66!!! Only a few threads will be running while all others have completed Allocate one thread per seed A seed in a crowded prism will still take more time because of more grains to check, but the fluctuation is only O(N 1/2 ) Further optimisation is possible, but not worth the effort Drawback: memory access is poorly coalesced (but tracks are not known in advance, and the memory span is broad) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN18

Step #3: From grains to microtracks Pitfalls of GPU-coding for this algorithm – #3 Track merging In the ideal case, a track with n grains has n(n-1)/2 seeds The same track is reconstructed several times Track “clones” must be merged Tracks are checked in pairs comparing position and direction The track with fewer grains is suppressed (or the “second” in case of a tie) Reduction of combinations by proximity (tracks are stored in a grid of XY cells) The code is non-deterministic: the order of tracks matters in producing the result The “quality” of the set is always the same, but small differences can arise (0.1%) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN19

Performances: view correction/mapping Time spent in image/view correction Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN20 GTX640 GTX690 (1/2) GTX780Ti Tesla C2050 Dark clusters Time (ms)

Performances: view correction/mapping Most time is spent in tracking – Fraction of total time (Steps 1+2+3) spent in cluster-grain processing (Steps 1+2) Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN21

Performances: tracking Tracking time vs. grains/view Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN22 grains Time (ms) GTX640 GTX690 (1/2) GTX780Ti Tesla C2050

Performances: tracking Tracking time vs. tracks/view Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN23 tracks Time (ms) No visible nonlinear bottlenecks GTX640 GTX690 (1/2) GTX780Ti Tesla C2050

Performances: tracking Tracking time vs. tracks/view Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN24 Log10(Time) GTX640 Log10(grains/view) Exponent: 1.88 Log10(Time) Log10(grains/view) GTX780Ti Exponent: 1.71 The dependency is better than N 2 : the weight of computation stages with high combinatorial complexity is relatively small

Performances: tracking Compute work vs. grains – Computework := Time(ms)×Cores×Clock(MHz) – More recent architectures seem less efficient Effect of branch divergence with more cores/multiprocessor? Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN25 Log10(grains/view) Log10(computework) GTX640 (Fermi 2.1) GTX690 (1/2) (Kepler 3.0) GTX780Ti (Kepler 3.5) Tesla C2050 (Fermi 2.0)

Performances: tracking Identifying and understanding bottlenecks – Data from GTX640 – Branch divergence is almost only end-wait (completed threads wait for others running) Difficult to improve without additional complications in the code May be worth the effort for Maxwell architecture – Track merging jumps over “Track” data blocks that are anyway large – Memory could be coalesced by reshuffling thread order – under study – Track fitting is negligible w.r.t. track recognition Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN26

Conclusions The solution shown uses GPU’s to implement a complex algorithm with many logical branches and non-trivial memory access patterns The tracking portion (Step #3) is suitable for a wide range of types of straight tracks (magnetic field weak or absent) Tracking planes or volume detectors natively supported “Know your code”: GPU’s are effective, but efficient solutions need careful optimisation The algorithm performances scale well with data size There is room to improve the performances with the latest generation of boards Pisa, 10/9/2014Cristiano Bozza - Università di Salerno / INFN27

GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks Cristiano Bozza Università di Salerno/INFN Pisa – 10/9/2014.

Similar presentations

Presentation on theme: "GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks Cristiano Bozza Università di Salerno/INFN Pisa – 10/9/2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks Cristiano Bozza Università di Salerno/INFN Pisa – 10/9/2014.

Similar presentations

Presentation on theme: "GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks Cristiano Bozza Università di Salerno/INFN Pisa – 10/9/2014."— Presentation transcript:

Similar presentations

About project

Feedback