Track finding and fitting with GPUs Oleksiy Rybalchenko and Mohammad Al-Turany GSI-IT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Track finding and fitting for Panda: Central tracker : MVD +STT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Work is already ongoing for offline and online in PANDA 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Secondary Vertex: Re-iterate with each hit point as reference point 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Lines in the conformal space can be found using the Hough Transform 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
David Münchow and Soeren Lange kindly shared there C++ code with us. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Our Group: We would like to port the code developed at Giessen to GPU Compare the performance, scalability, re-usability, …etc, with other hardware and/or software techniques on the market Build a computing node prototype based on GPU that can make tracking on the fly (online) without IO. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
The graphics processor units: GPUs Nowadays, GPUs have evolved into high performance co-processors They can be easily programmed with common high-level language such as C, Fortran and C++. Todays GPUs greatly outpace CPUs in arithmetic performance and memory bandwidth, making them the ideal co-processor to accelerate a variety of data parallel applications. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Floating-Point Operations per Second for CPU and GPU 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Memory Bandwidth for the CPU and GPU 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Why NVIDIA? Support a true cache hierarchy in combination with on-chip shared memory First GPU architecture to support ECC, it detects and corrects errors before system is affected. It also Protects register files, shared memories, L1 and L2 cache, and DRAM Limited, but increasing support of C++ Tesla product family is designed ground-up for parallel computing and offers exclusive computing features. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Why CUDA? CUDA development tools work alongside the conventional C/C++ compiler, so one can mix GPU code with general-purpose code for the host CPU. CUDA Automatically Manages Threads: o It does NOT require explicit management for threads in the conventional sense, which greatly simplifies the programming model. Stable, available (for free), documented and supported for windows, Linux and Mac OS Low learning curve: o Just a few extensions to C o No knowledge of graphics is required 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
CUDA Features: Standard C/C++ language for parallel application development on the GPU Standard numerical libraries: o FFT (Fast Fourier Transform) o BLAS (Basic Linear Algebra Subroutines) o CUSPARSE (for handling sparse matrices) o CURAND (generation of high-quality pseudorandom and quasirandom numbers.) Dedicated CUDA driver for computing with fast data transfer path between GPU and CPU 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Automatic Scalability 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
CUDA Toolkit C/C++ compiler Visual Profiler cuda-gdb debugger GPU-accelerated BLAS library GPU-accelerated FFT library GPU-accelerated Sparse Matrix library GPU-accelerated RNG library Additional tools and documentation 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
GPU usage examples in FairRoot 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
GPUs usage in FairRoot FindCuda.cmake (Abe Stephens SCI Institute) – Integrate CUDA into FairRoot very smoothly CMake creates shared libraries for cuda part FairCuda is a class which wraps CUDA implemented functions so that they can be used directly from ROOT CINT or compiled code 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Track Propagation Speedup by a factor 40 4/29/11 Propagation/E vent Tesla (CPU/GPU) Track propagation (RK4) in magnetic field on GPU PANDA FEE/DAQ/Trigger Workshop Grünberg
Track propagation (RK4) using PANDA Field 4/29/11 Speedup : up to factor 175 PANDA FEE/DAQ/Trigger Workshop Grünberg
Track + Vertex Fitting (PANDA) 4/29/11 The same program code, same hardware, just using Pinned Memory instead of Mem-Copy! CPU time/GPU time Copy data Execute Copy data Execute PANDA FEE/DAQ/Trigger Workshop Grünberg
Porting track finder/Fitter to CUDA 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
First issues: Original code is not optimized for parallel architectures Lookup tables are used for the mathematical functions (Code is designed to work on FPGA) Redesign the code into many functions (kernels) instead of one main Use the standard mathematical libraries delivered by NVIDIA 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Redesigning the code into kernels The code became more modular. Different kernels (functions) have different parallelization level. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
A total improvement of factor 200 compared to the original code on an Intel Xeon CPU 2.53 GHz CPU (ms)GPU (ms)ImprovementOccupancyNotes total runtime (without Z-Analysis) startUp() %runs (num_points) times setOrigin() %runs (num_points) times clear Hough and Peaks (memset on GPU) %runs (num_points) times conformalAndHough() %runs (num_points) times findPeaksInHoughSpace() %runs (num_points) times findDoublePointPeaksInHoughSpace() %runs (num_points) times collectPeaks() %runs (num_points) times sortPeaks() %runs (num_points) times resetOrigin() %runs (num_points) times countPointsCloseToTrackAndTrackParams() %runs once collectSimilarTracks() %runs once collectSimilarTracks2()2%runs once getPointsOnTrack() %runs (num_tracks) times nullifyPointsOfThisTrack() %runs (num_tracks) times clear Hough space (memset on GPU) %runs (num_tracks) times secondHough() %runs (num_tracks) times findPeaksInHoughSpaceAgain() %runs (num_tracks) times collectTracks() %runs (num_tracks) times 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Profiler output for GPU time for each kernel 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Up to 130 Giga Bytes global memory write throughput is reached with the current implementation Kernel GPU time % Global memory write throughput GB/s Global memory read throughput GB/s Occupancy % Total number of Threads launched conformalAndHough findPeaksInHoughSpace collectPeaks findDoublePointPeaksIn HoughSpace memset32_aligned1D /29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
The results are comparable: CPU Results found tracks in first step: 1615 collect similar tracks: done number of tracks: 158 recalculate tracks: done number of tracks: 14 GPU Results found tracks in first step: 1609 collect similar tracks: done number of tracks: 151 recalculate tracks: done number of tracks: 14 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Planed proto-type for online tracking: Data buffer is allocated by the GPU (Pinned/mapped memory) 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Plans for the Online: Online track finding and fitting with GPU In collaboration with the GSI EE, build a proto type for an online system o Use the PEXOR card to get data to PC o PEXOR driver allocate a buffer in PC memory and write the data to it o The GPU uses the Zero copy to access the Data, analyze it and write the results 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg PEXOR The GSI PEXOR is a PCI express card provides a complete development platform for designing and verifying applications based on the Lattice SCM FPGA family. Serial gigabit transceiver interfaces (SERDES) provide connection to PCI Express x1 or x4 and four 2Gbit SFP optical transceivers 30
Summary and outlook Porting the code is ongoing, for the ported part we already gain factor 200 compared to single threaded CPU code. We need to use the fast Hough transform reported by David and Soeren. Code optimization should follow. For the computing node prototype based on GPU –PEXOR we need: o To re-write the PEXOR driver to accept externally allocated data buffer o Use one PEXOR as a transmitter to send Geant events (and/or Time streams) for testing and debugging 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg