Parallelisation of Random Number Generation in PLACET Approaches of parallelisation in PLACET Martin Blaha University of Vienna AT CERN
Additions to centralised RNG TCL command RandomReset o Sets seeds to all streams individually RandomReset –stream Misalignments –seed 1234 o sets default seeds = reset, if called without argument o sets generators o replaces redundancy in Tcl commands that set seeds (e.g. Groundmotion_init) o Help that lists all streams Benchmarks on not parallelised code o gsl causes slowdown of max. 3% depending on generator 1
Motivation for parallel execution Runtimes of simulations slow! “Low-performance” functions: ●SBEND ●QUADRUPOLE ●MULTIPOLE ●ELEMENT → they refer to RNGs through syncrotron radiation emission profile by Yngve Levinsen Feb
Parallel Random Number Generation Problems requesting random numbers from a sequential stream for parallel use is uncontrolable controlable and reproducible gsl random number generators do not support parallel generation by itself 3
Methods for parallel random number generation ●centralized generation ●replicated generation ●distributed generation ●existing Libraries 4
Centralized RNG One generator produces all numbers Advantages: only one RNG with good sequence easy implementation Disadvantage: race conditions occur fair play not guaranteed or crash (programme not stable) slow if queueing (even slower than single thread) 5
Replicated RNG Initial RNG is copied for each thread Advantages: more efficient easy implementation Disadvantage: can suffer from correlations between threads 6
Distributed RNG Each thread has its own generator Advantages: efficient - each thread can work stand alone threadsafe reproducible Disadvantage: can suffer from correlations 7
Existing Libraries SPRNG - University of Florida hard to find “good” documentation on how to combine with parallel code eg OpenMp PRAND for CUDA environment on GPU and CPU good documentation on RNGs in general Disadvantage: yet another library 8
Distributed RNG Summary: distributed generation considered to fit the best for our needs Common methods that are known to produce satisfactory outcome 1.Random Tree Method 2.Block Splitting 3.Leapfrog Method 9
Random Tree Method Global RNG for seeding Standalone RNG per thread Reproducible for known number of threads new tcl command to set number of threads → only runs fair for the same number of threads, not for dinamical thread assignment Seed 10
Block Splitting Split a sequence of RN in blocks Advantages: no overlap in random numbers plays fair Disadvantages: allocates a huge array of numbers number of RNs has to be known in advance 11
Leapfrog Method Distributes a sequence or RN over several threads one by one Advantages: number of RNs must not be known in advance guarantees no overlap of RN plays fair, still permutations in calls Disadvantage: costly call of random numbers 12
Block splitting vs. Leapfrog Block-Splitting and Leapfrog runs fair with dynamic thread assignment Problem of implimentation in a distributed, non centralised way Period per thread is period of RNG/# threads 13
Testing parallel RNG methods SPEEDUP to -33,3% in runtime for random tree method only overheads for nosynrad and little number of particles SLOWDOWN to + 120% in runtime for leapfrog method due to withdrawing more numbers than needed Testing via test-bds-track for particles, with quadrupoles and multipoles 14
Preparation Tool for Parallelisation - OpenMp easy implementation control of variable scope, assignment schedule, critical sections 15
Preparation: Centralising synrad functions 2 functions calculate synrad emmission: synrad.cc photon_spectrum.cc Centralised for easier and reproducible use of parallel RNG synrad.cc has been removed Tested via test-bds-track for 3e5 particles, same outcome 16
Implementation of new class New class PARALLEL_RNG Inherits all methods from RANDOM_NEW Initialises parallel RNG always on max. number of available threads New Tcl-command ParallelThreads –num val to choose number of threads Now RNG stream Radiation runs completely parallel by default 17
Testing – BDS tracking Covariance Matrix of test-bds-track 18 Testing via test-bds-track for particles, with quadrupoles and multipoles
Testing – CLIC beam tracking Beam - tracking with no correctionBeam - tracking with simple correction 19 Testing test-clic-3 for 3500 machines
Time Profile Total runtime on 32 cores: 27 sec Total runtime on 1 core 1 m 21 sec Total runtime on PLACET: 58 sec BDS tracking: PLACET: 39 sec PLACET-NEW: ~9 sec BDS TRACKING 3.5 times faster Timeprofile for BDS tracking for particles 2x Intel Xeon E GHz 8-Core (16 w/hyper threading) (95W 20MB 2.8GHz Turbo Sandy Bridge EP) 20
Profiling BDS: Sbend, elements, multipole, quadrupole still most timeconsuming functions Linac: OMP library causes slowdown in simple-correction routines (e.g. test-clic-4) 76% of time consumption caused by OpenMP in wait_sleep It was necessary to find a compromise! 21
Conclusion BDS runs ~30 % faster (total runtime) CLIC 4 runs ~13 % faster Compared to current placet in the trunk OpenMP is a quick and easy way to parallelisation for existing functions. 22
Future Plan Need to understand the overhead while running sequential Benchmark performance of quick functions e.g. dipoles, drifts, step-in, BPMs Adjust automatically to current configuration Write technical/user documentation Merge into trunk 23