Optimizing the trace transform Using OpenMP and CUDA Tim Besard
Trace transform needs to be real-time MATLAB – Slow – Difficult to optimize C++ base implementation – Allows for optimizations
Optimizing the trace transform How to parallelize? OpenMP CUDA Performance
How to parallelize? p p
Coarse-grained parallelism Rotate 0° T-functionals 359° …… Sinogram row Sinogram
How to parallelize? Fine-grained parallelism – Rotation – Functionals: prefix sum
OpenMP Compiler directives – #pragma omp parallel for – #pragma omp critical – #pragma omp barrier
OpenMP Compiler directives Address coarse-grained parallelism – Unobtrusive – Significant overhead 5× speed-up – 8-core machine – Unoptimized
CUDA Parallel computing platform Programming model – Lightweight threads – Massively parallel Address fine-grained parallelism – Pixel-centric approach – Complete re-implementation
CUDA Low-level details matter a lot – Memory access patterns – Branch divergence 10× speed-up – GeForce GTX TITAN (20% usage) – Unoptimized
Performance for 10 signatures Execution time in milliseconds MEX C++ OpenMP CUDA
Future work Optimize CUDA – Compare against state of the art Julia implementation – Algorithmic IR
Optimizing the trace transform Using OpenMP and CUDA Tim Besard