Validation October 13, 2017
Validation: GPU vs SNB (Broadwell) http://mlefebvr.web.cern.ch/mlefebvr/20171010_gpu_val_test/ https://github.com/mpbl/mictest/tree/gpu-validation The 3 other plots follow the abscissa with a null ordinate
Bottleneck Hunting October 13, 2017
Single Event Profiling To fill GPUs and hide data transfers, using a number of streams to concurrently schedule event computations on a GPU helps. Can we have a better understanding of all the bottlenecks in the GPU code and can we order them from the most limiting to the least limiting? If we were able to significantly improve the performance of computing a single event, we might positively impact multiple events reconstruction time-to-solution. Note: We'll be using CUDA 9 and the associated nvvp. While still not up-to-par with VTune, This version provides new features making it easier to see what's going on. Let's play with the clone-engine; a priori the one taking the most serious performance hit.
Occupancy: CE Seed-based Achieved is half of theoretical, and we are limited by registers. With a gplex size of 10,000 and a block size of 256, we have 40 blocks. Each block is scheduled on one of the P100's 56 SM. This explains the difference between achieved and theoretical. But that's most likely mitigated by having concurrent events (as additional events get scheduled on free SMs). Note that it does not make sense to use a larger gplex size: the seed-based parallelization has a concurrency factor of at most the number of seed.
Occupancy: CE Track-based parallelization With a ```Gplex``` size of 10,000 (thus equal to the number of seeds), we are in a very similar situation: imbalance between SMs due to a number of blocks that is too small.
Occupancy: CE Track-based parallelization With the track-based parallelization, we can raise the number of threads up to the maximum number of track candidates. Using a GPlex size of 30,000:
Real Issues: Memory Dependencies What does memory dependency mean: Load / store cannot be done Global mem. contains arrays in Global memory AND spilled registers (said to be in local memory, actually a part of global mem.) Why: Resource not available Resource fully utilized Too many requests are outstanding Where does it come from (first suspects): Reorganize to Gplex (aka SlurpInIdx_fn). Updating candidate arrays with new best candidates. Some dependencies on GPlex stores. Loads are fine. MatrixRepresentationStatic.h (Tracks)=> not coalescent, not aligned What should we start with: Profiler do not show hotspots, nor time spent per function, loc,..
Updating Candidate list with new bests Cands Compiled with --keep Register spilling in local memory. Initialize rd1 to 0 … in local memory Track tmp_track; Load track from global_mem And store it right back to local mem Track& base_track = candidates[…] tmp_track = base_track