Code optimization in BLonD D. Quartullo 30/10/2014HEADTAIL working group meeting1 Acknowledgements: T. Argyropoulos, B. Hegner, A. Lasheen, J. E. Muller,

Code optimization in BLonD D. Quartullo 30/10/2014HEADTAIL working group meeting1 Acknowledgements: T. Argyropoulos, B. Hegner, A. Lasheen, J. E. Muller, D. Piparo, E. Shaposhnikova, H. Timko

30/10/2014HEADTAIL working group meeting2 Outlook  Why optimize BLonD?  Language used and tools for optimization Python Faster with C: Ctypes Spider profiler Gcc compiler and flags  BLonD structure and optimization strategies Main file, packages, modules and setup files What to optimize: RAM memory and computation time  Asymptotic study and routines optimization Definitions and parameters of interest Histogram constant space Tracker  Two realistic cases: LHC ramp with feedback and no impedances SPS with full impedance model at injection Observations  Summary and next steps  Experimental: parallelization with OpenMP

30/10/2014HEADTAIL working group meeting3 Outlook  Why optimize BLonD?  Language used and tools for optimization Python Faster with C: Ctypes Spider profiler Gcc compiler and flags  BLonD structure and optimization strategies Main file, packages, modules and setup files What to optimize: RAM memory and computation time  Asymptotic study and routines optimization Definitions and parameters of interest Histogram constant space Tracker  Two realistic cases: LHC ramp with feedback and no impedances SPS with full impedance model at injection Observations  Summary and next steps  Experimental: Parallelization with OpenMP

30/10/2014HEADTAIL working group meeting4 Why optimize BLonD?

30/10/2014HEADTAIL working group meeting5 Outlook  Why optimize BLonD?  Language used and tools for optimization Python Faster with C: Ctypes Spider profiler Gcc compiler and flags  BLonD structure and optimization strategies Main file, packages, modules and setup files What to optimize: RAM memory and computation time  Asymptotic study and routines optimization Definitions and parameters of interest Histogram constant space Tracker  Two realistic cases: LHC ramp with feedback and no impedances SPS with full impedance model at injection Observations  Summary and next steps  Experimental: Parallelization with OpenMP

30/10/2014HEADTAIL working group meeting6 Python The language used is Python mainly because it’s: Fast to program with it Open source Very spread among the scientific computing community and for that reason there exists a lot of support Full of packages and routines that allow the user not to envy persons using Matlab or Mathematica applications. We are using the Anaconda distribution that includes Python v2.7 64bit and all the packages that we need. Anaconda can be installed very easily in Linux, Windows and Mac. Link: https://store.continuum.io/cshop/anaconda/ But since the Python code, as its similar cousins, is interpreted, typically we can’t reach the same performances as with compiled code as for example C/C++ or Fortran. We need something to boost Python...

30/10/2014HEADTAIL working group meeting7 Faster with C: Ctypes There exist plenty of plugins that allow to ‘connect’ C/C++ and Python codes. The main ones are Python-C-API, Ctypes, SWIG, Cython. We started using Ctypes because: Allows to code the routines that one wants to optimise in pure C/C++ code with just an unavoidable overhead due to the type casting from Python to C. However this overhead for our applications in general is small since the quantity of floating point operations that Blondie has to carry out is large. The programmer has full control of the process of embedding C in Python (flags, parallelization in C, autovectorisation). The programmer has not to learn a new language, as it is for Cython, supposing that he knows C!!! If a certain routine is optimised in C and then linked with Python, the programmer is sure that all the best has been done.

Spyder is an IDE for Python. In it there is a profiler program able to measure and display with a nice interface the various times of the several routines used in a Python run. Spyder can be found for example in the Anaconda distribution. Spyder profiler uses Python’s module cProfile. A good web page where to find useful hints on profile techniques and performance analysis: http://www.huyng.com/posts/python-performance-analysis/ 30/10/2014HEADTAIL working group meeting8 Spyder profiler

30/10/2014HEADTAIL working group meeting9 Gcc compiler and flags We need a C++ compiler to compile the files that have to be linked to Python via the Ctypes module. GCC is open source, complete and easy to use. We are using GCC v4.8.1 64bit. Windows: http://tdm-gcc.tdragon.net/ Linux(CERN): source from /afs/cern.ch/sw/lcg/contrib/gcc/4.8.1/x86_64-slc6/setup.sh One can compile a.cpp file with certain options called ‘flags’ used mostly to optimize the code. For example: ‘-Oflag’ Autovectorisation, fast mathematical operations,... ‘-mfma4’ Fused-multiply-add operarions for even faster calculations (not all machines support this feature but LXPLUS yes!) ‘-std=c++11’ Last standard for the C++ language ‘-fopenmp’ Task parallelization with OpenMP Think of it as a parallelization with just one core!! Speed up operations of the type a + b x c !!

30/10/2014HEADTAIL working group meeting10 Outlook  Why optimize BLonD?  Language used and tools for optimization Python Faster with C: Ctypes Spider profiler Gcc compiler and flags  BLonD structure and optimization strategies Packages, modules and setup files What to optimize: RAM memory and computation time  Asymptotic study and routines optimization Definitions and parameters of interest Histogram constant space Tracker  Two realistic cases: LHC ramp with feedback and no impedances SPS with full impedance model at injection Observations  Summary and next steps  Experimental: Parallelization with OpenMP

30/10/2014HEADTAIL working group meeting11 Packages, modules and setup files Code documentation Test - cases Create a new beam and slice it C++ optimised routines Cython routines (just for test) Impedances in time and frequency domain Define beam and rf parameters Feedbacks, RF noise Statistics Plots Trackers README file Setup file for the C++ optimised routines Setup file for Cython optimised routines

30/10/2014HEADTAIL working group meeting12 What to optimize: RAM memory and computation time  In BLonD we store momentum program, beam energy, beam coordinates and similars in arrays. In that way we can save on average computation time since we calculate once and for all at the beginning all the things that with high probability could be called at least two times in the rest of the code.  This technique increases obviously significantly the RAM memory used.  On the other hand the RAM consuption is currently not a problem even if we use the double precision type that requires 8 byte for every number saved. E.g. Our office PCs as well as LXPLUS machines have at least 4 GB of RAM and the simulation of the LHC ramp takes approximately 1.5 GB of memory.  Obviously there could be some problems if we want to launch two or more simulations on the same machine (e.g. In local), but on the LXPLUS batch service for example one can request at least 4 GB of RAM for each submitted jobs.  Then all the effort has been put on the computation time saving.

High computation time Negligible RAM usage 30/10/2014HEADTAIL working group meeting13 What to optimize: RAM memory and computation time Negligible computation time High RAM usage Zero computation time Negligible RAM usage

30/10/2014HEADTAIL working group meeting15 Definitions and parameters of interest f and g grows at the same pace f grows slower than g

30/10/2014HEADTAIL working group meeting16 Definitions and parameters of interest This implies that C[quick_sort] = o(C[insertion_sort]), that is quick sort is more performant than insertion sort, at least for large n. Main parameters of interest for the optimization study of BLonD: Number of macroparticles M Number of slices S We will suppose that these two variables are independent from each other, in other words we can have M -> ∞ and/or S -> ∞.

30/10/2014HEADTAIL working group meeting17 Histogram constant space  Code not optimized: numpy.histogram is too expensive if the slicing is done with constant space. Why? Let’s have a look with Spyder that easily can show all the subroutines of a given routine.  From this picture still we can’t say a lot since we don’t know how the various subroutines are used. Let’s go deeper... Binary research algorithm used

Histogram constant space 30/10/2014HEADTAIL working group meeting18

30/10/2014HEADTAIL working group meeting19 Histogram constant space x M

30/10/2014HEADTAIL working group meeting20 Tracker It can be even 17x more expensive than math.sin in Python!!!

30/10/2014HEADTAIL working group meeting21 Tracker The sin in math.h is not autovectorizable for two reasons: It is not inline so when there is a call to this function in an otherwise vectorizable for loop, the compiler doesn’t vectorize It doesn’t use polynomials but large Look Up Tables and so it can’t be vectorized. Solution: Taylor series? Good since we would deal with polynomials, easily vectorizable. But even better... Pade rational functions that allow autovectorisation (they are ratios of polynomials) and need less terms than a Taylor expansion to get a fixed accuracy.. We use the fast_sin routine from the VDT CERN library (D. Piparo and others)  KICK_ACCELERATION EQUATION beam_dE = beam_dE + acceleration_kick In C++ we have a for loop that is immediately vectorised

30/10/2014HEADTAIL working group meeting22 Tracker

30/10/2014HEADTAIL working group meeting24 LHC ramp with feedback and no impedances Parameters: M = 50000, S = 100, num_turns = 1000. Machine: my office PC  CODE NOT OPTIMIZED

30/10/2014HEADTAIL working group meeting25 LHC ramp with feedback and no impedances  CODE OPTIMIZED  RESULTS: histogram: from 3.477 to 0.188 tracker: from 1.877 to 0.747

30/10/2014HEADTAIL working group meeting26 SPS with full impedance model at injection Parameters: M = 5000000, S = 500, num_turns = 10. Machine: my office PC  CODE NOT OPTIMIZED

 CODE OPTIMIZED  RESULTS: histogram: from 3.480 to 0.178 tracker: from 4.949 to 1.010 30/10/2014HEADTAIL working group meeting27 SPS with full impedance model at injection

30/10/2014HEADTAIL working group meeting28 Observations

30/10/2014HEADTAIL working group meeting29 Observations LHC numpy.histogram SPS numpy.histogram ceil(5000000 / 65536) x 10 = ceil(79.26) x 10

30/10/2014 Summary and next steps  BLonD has to be optimised if we want to carry out expensive simulations.  The tools used for the various optmizations have been shown.  An asynthotic study of the histogram routine has been done.  The histogram and the tracker have been optimized with very good results; in addition it has been shown that it’s difficult to optimize these routines further, at least with serial codes. IN THE NEXT FUTURE…  Optimization of other time consuming routines, for example FFT, convolution, interpolation, hamiltonian..., with and without parallelization.  Multibunch and parallelization have to be done at the same time task parallelization  define dependences among classes ‘trivial’ parallelization over cores e.g. for tracker physics’ parallelization of intensity effects time domain: where to truncate the convolution 31HEADTAIL working group meeting

30/10/2014HEADTAIL working group meeting33 Experimental: Parallelization with OpenMP  OpenMP is an interface that allows to parallelize our tasks when we have multiple cores that share a fixed memory, as it happens on our office PCs or on LXPLUS.  The user can choose easily the number of cores to use. We found that on LXPLUS it’s often better to use 7 cores than 8, maybe because one core is generally used more than the others since it’s the main one (see next slide).  In our case a lot of tasks can be parallelized, for example the tracker, the histogram and the interpolation routines since the particles are indipendent from each other.  However it’s not trivial to parallelize with OpenMP and sometimes, if the parallelization is not done efficiently or the size of the problem is too small so that the cost of the communication between processors overcomes the gain derived from the splitting, a serial code could even be faster (see the last benchmarking).  Cprofiler, and so Spyder profiler as well, has not been tested by their developers for profiling multithreaded codes. On the other hand it’s difficult to find reliable time profilers for multicore routines.

30/10/2014HEADTAIL working group meeting34 NUM PY 1 core 2 cores 3 cores 4 cores 5 cores 6 cores 7 cores 8 cores max Test 17.4891.6480.8420.6500.4570.4220.3130.3050.3464.084 Test 27.5141.6360.8210.5660.4720.3500.3020.2910.3360.462 Test 37.4961.6000.8740.5760.4460.3770.3370.2990.3480.331 Test 47.5191.6090.8270.6450.5080.3630.3050.2990.3260.312 Test 57.4861.6240.8310.5760.4530.3960.3560.2780.3530.274 Experimental: Parallelization with OpenMP Comparison of the Numpy histogram against a parallelized version of the optimised histogram discussed earlier. Parameters: 500000 particles, 100 turns. Machine: LXPLUS

NUMPYPARALLEL 2 CORES VECTORIZAB LE DOUBLE OPTIMISED DOUBLE VECTORIZAB LE FLOAT OPTIMISED FLOAT Test 13.5470.4880.4450.2040.2830.183 Test 23.5660.5200.4390.2070.2850.185 Test 33.5370.5360.4550.2060.2950.185 NUMPYPARALLEL 7 CORES (best) VECTORIZAB LE DOUBLE OPTIMISED DOUBLE VECTORIZAB LE FLOAT OPTIMISED FLOAT Test 17.9240.2670.5650.2630.3900.203 Test 27.5820.2570.5580.2520.3860.204 Test 37.8070.2910.5840.2650.3760.221 LOCAL LXPLUS 1 CORE + 0.2 for casting 30/10/2014HEADTAIL working group meeting35 Benchmarking of various histogram methods: 500000 particles, 100 turns  OPTIMIZED is the optimised method discussed before in these slides  VECTORIZABLE derives from OPTIMIZED, it’s autovectorizable but it has two for loops inside instead of one  The PARALLEL method is the same as the one in the previous slide Experimental: Parallelization with OpenMP

30/10/2014HEADTAIL working group meeting36 “ We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” Donald Knuth, computer scientist Professor Emeritus at Stanford University, called the "father of the analysis of algorithms". THANK YOU!

Code optimization in BLonD D. Quartullo 30/10/2014HEADTAIL working group meeting1 Acknowledgements: T. Argyropoulos, B. Hegner, A. Lasheen, J. E. Muller,

Similar presentations

Presentation on theme: "Code optimization in BLonD D. Quartullo 30/10/2014HEADTAIL working group meeting1 Acknowledgements: T. Argyropoulos, B. Hegner, A. Lasheen, J. E. Muller,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Code optimization in BLonD D. Quartullo 30/10/2014HEADTAIL working group meeting1 Acknowledgements: T. Argyropoulos, B. Hegner, A. Lasheen, J. E. Muller,

Similar presentations

Presentation on theme: "Code optimization in BLonD D. Quartullo 30/10/2014HEADTAIL working group meeting1 Acknowledgements: T. Argyropoulos, B. Hegner, A. Lasheen, J. E. Muller,"— Presentation transcript:

Similar presentations

About project

Feedback