Download presentation
Presentation is loading. Please wait.
Published byBrett Newman Modified over 8 years ago
1
Track finding and fitting with GPUs Oleksiy Rybalchenko and Mohammad Al-Turany GSI-IT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
2
Track finding and fitting for Panda: Central tracker : MVD +STT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
3
Work is already ongoing for offline and online in PANDA 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
4
Secondary Vertex: Re-iterate with each hit point as reference point 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
5
Lines in the conformal space can be found using the Hough Transform 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
6
David Münchow and Soeren Lange kindly shared there C++ code with us. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
7
Our Group: We would like to port the code developed at Giessen to GPU Compare the performance, scalability, re-usability, …etc, with other hardware and/or software techniques on the market Build a computing node prototype based on GPU that can make tracking on the fly (online) without IO. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
8
The graphics processor units: GPUs Nowadays, GPUs have evolved into high performance co-processors They can be easily programmed with common high-level language such as C, Fortran and C++. Todays GPUs greatly outpace CPUs in arithmetic performance and memory bandwidth, making them the ideal co-processor to accelerate a variety of data parallel applications. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
9
Floating-Point Operations per Second for CPU and GPU 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
10
Memory Bandwidth for the CPU and GPU 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
11
Why NVIDIA? Support a true cache hierarchy in combination with on-chip shared memory First GPU architecture to support ECC, it detects and corrects errors before system is affected. It also Protects register files, shared memories, L1 and L2 cache, and DRAM Limited, but increasing support of C++ Tesla product family is designed ground-up for parallel computing and offers exclusive computing features. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
12
Why CUDA? CUDA development tools work alongside the conventional C/C++ compiler, so one can mix GPU code with general-purpose code for the host CPU. CUDA Automatically Manages Threads: o It does NOT require explicit management for threads in the conventional sense, which greatly simplifies the programming model. Stable, available (for free), documented and supported for windows, Linux and Mac OS Low learning curve: o Just a few extensions to C o No knowledge of graphics is required 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
13
CUDA Features: Standard C/C++ language for parallel application development on the GPU Standard numerical libraries: o FFT (Fast Fourier Transform) o BLAS (Basic Linear Algebra Subroutines) o CUSPARSE (for handling sparse matrices) o CURAND (generation of high-quality pseudorandom and quasirandom numbers.) Dedicated CUDA driver for computing with fast data transfer path between GPU and CPU 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
14
Automatic Scalability 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
15
CUDA Toolkit C/C++ compiler Visual Profiler cuda-gdb debugger GPU-accelerated BLAS library GPU-accelerated FFT library GPU-accelerated Sparse Matrix library GPU-accelerated RNG library Additional tools and documentation 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
16
GPU usage examples in FairRoot 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
17
GPUs usage in FairRoot FindCuda.cmake (Abe Stephens SCI Institute) – Integrate CUDA into FairRoot very smoothly CMake creates shared libraries for cuda part FairCuda is a class which wraps CUDA implemented functions so that they can be used directly from ROOT CINT or compiled code 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
18
Track Propagation Speedup by a factor 40 4/29/11 Propagation/E vent Tesla (CPU/GPU) 10 11 50 15 100 15 200 24 500 34 700 41 Track propagation (RK4) in magnetic field on GPU PANDA FEE/DAQ/Trigger Workshop Grünberg
19
Track propagation (RK4) using PANDA Field 4/29/11 Speedup : up to factor 175 PANDA FEE/DAQ/Trigger Workshop Grünberg
20
Track + Vertex Fitting (PANDA) 4/29/11 The same program code, same hardware, just using Pinned Memory instead of Mem-Copy! CPU time/GPU time Copy data Execute Copy data Execute PANDA FEE/DAQ/Trigger Workshop Grünberg
21
Porting track finder/Fitter to CUDA 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
22
First issues: Original code is not optimized for parallel architectures Lookup tables are used for the mathematical functions (Code is designed to work on FPGA) Redesign the code into many functions (kernels) instead of one main Use the standard mathematical libraries delivered by NVIDIA 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
23
Redesigning the code into kernels The code became more modular. Different kernels (functions) have different parallelization level. 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
24
A total improvement of factor 200 compared to the original code on an Intel Xeon CPU W3505 @ 2.53 GHz CPU (ms)GPU (ms)ImprovementOccupancyNotes total runtime (without Z-Analysis)117138590199 startUp()0.250.0122202%runs (num_points) times setOrigin()0.250.01192125%runs (num_points) times clear Hough and Peaks (memset on GPU)30.046365100%runs (num_points) times conformalAndHough()730.83638725%runs (num_points) times findPeaksInHoughSpace()510.497103100%runs (num_points) times findDoublePointPeaksInHoughSpace()40.064562100%runs (num_points) times collectPeaks()40.06661100%runs (num_points) times sortPeaks()0.250.036872%runs (num_points) times resetOrigin()0.250.01212125%runs (num_points) times countPointsCloseToTrackAndTrackParams()224440.95812342633%runs once collectSimilarTracks() 42.35062 67%runs once collectSimilarTracks2()2%runs once getPointsOnTrack()0.250.01871333%runs (num_tracks) times nullifyPointsOfThisTrack()0.250.01062433%runs (num_tracks) times clear Hough space (memset on GPU)20.0024833100%runs (num_tracks) times secondHough()0.250.073434%runs (num_tracks) times findPeaksInHoughSpaceAgain()2900.2373122266%runs (num_tracks) times collectTracks()0.250.036872%runs (num_tracks) times 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
25
Profiler output for GPU time for each kernel 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
26
Up to 130 Giga Bytes global memory write throughput is reached with the current implementation Kernel GPU time % Global memory write throughput GB/s Global memory read throughput GB/s Occupancy % Total number of Threads launched conformalAndHough 541.341.122542 880 findPeaksInHoughSpace3213.245.2010014 353 408 collectPeaks40.0145.6510014 424 064 findDoublePointPeaksIn HoughSpace 40.0045.6810014 350 336 memset32_aligned1D3130.460.0810028 753 728 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
27
The results are comparable: CPU Results found tracks in first step: 1615 collect similar tracks: done number of tracks: 158 recalculate tracks: done number of tracks: 14 GPU Results found tracks in first step: 1609 collect similar tracks: done number of tracks: 151 recalculate tracks: done number of tracks: 14 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
28
Planed proto-type for online tracking: Data buffer is allocated by the GPU (Pinned/mapped memory) 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
29
Plans for the Online: Online track finding and fitting with GPU In collaboration with the GSI EE, build a proto type for an online system o Use the PEXOR card to get data to PC o PEXOR driver allocate a buffer in PC memory and write the data to it o The GPU uses the Zero copy to access the Data, analyze it and write the results 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
30
4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg PEXOR The GSI PEXOR is a PCI express card provides a complete development platform for designing and verifying applications based on the Lattice SCM FPGA family. Serial gigabit transceiver interfaces (SERDES) provide connection to PCI Express x1 or x4 and four 2Gbit SFP optical transceivers 30
31
Summary and outlook Porting the code is ongoing, for the ported part we already gain factor 200 compared to single threaded CPU code. We need to use the fast Hough transform reported by David and Soeren. Code optimization should follow. For the computing node prototype based on GPU –PEXOR we need: o To re-write the PEXOR driver to accept externally allocated data buffer o Use one PEXOR as a transmitter to send Geant events (and/or Time streams) for testing and debugging 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.