Optimizing the code using SSE intrinsics SIMD experiments Optimizing the code using SSE intrinsics
Summary Software speedup when using SIMD operations Problems with using SIMD operations (long loading/setting times, fixed data types) Possible solutions (new data structures) Comparison of speedup when using SIMD operations in different contexts Speedup using a custom library for code manageability Speedup when optimizing a double loop Suggestions on using SIMD instructions
Performance increase in vectorized code RecoLocalTracker/SiStripRecHitConverter SiStripRecHitMatcher::project For the unit test using PfmCodeAnalyser UNHALTED_CORE_CYCLES Without SSE intrinsics: 28297893 With SSE intrinsics: 16407853 Speedup: 1.72
Reasons Vectorized arithmetic operations increase the performance Loading/setting of data into the xmm registers takes a long time, because the data isn't adapted for use with SIMD instructions
Code GlobalPoint globalpointini=(det->surface()).toGlobal(strip.first); GlobalPoint globalpointend=(det->surface()).toGlobal(strip.second); //toGlobal __m128 ret1 = _mm_set1_ps(locp1.y()); ret1 = _mm_mul_ps(m1, ret1); __m128 t2 = _mm_set1_ps(locp1.z()); t2 = _mm_mul_ps(m2, t2); ret1 = _mm_add_ps(ret1, t2); t2 = _mm_set1_ps(locp1.x()); t2 = _mm_mul_ps(m0, t2); ret1 = _mm_add_ps(t2, ret1); ret1 = _mm_add_ps(ret1, p0vec); __m128 ret2 = _mm_set1_ps(locp2.y()); ret2 = _mm_mul_ps(m1, ret2); t2 = _mm_set1_ps(locp2.z()); t2 = _mm_mul_ps(m2, t2); ret2 = _mm_add_ps(ret2, t2); t2 = _mm_set1_ps(locp2.x()); t2 = _mm_mul_ps(m0, t2); ret2 = _mm_add_ps(t2, ret2); ret2 = _mm_add_ps(ret2, p0vec);
In the context of CMSSW RecoLocalTracker/SiStripRecHitConverter SiStripRecHitMatcher::match For the CMSSW test using CMSSW performance monitoring tools UNHALTED_CORE_CYCLES With sampling period: 1000 Without SSE intrinsics match + project: 263153 With SSE intrinsics match: 204539 Speedup: 1.3
Reasons Virtual methods are used, so without analyzing the code from a larger perspective, not much can be done Some of the code uses doubles, and with only a few operations to vectorize, we gain no speedup because of the load time Some operations are not vectorizeable
Code m.Invert(); AlgebraicVector2 solution = m * c; __m128d mult = _mm_set1_pd(1./(m00*m11 – m01*m10)); __m128d resultmatmul = _mm_mul_pd(_mm_add_pd(_mm_mul_pd(minv10, c0vec), _mm_mul_pd(minv00, _mm_set1_pd(m11*((float *)&ret1)[1]+m10*((float *)&ret1)[0]))), mult); //set all the constant __m128 and __m128d type variables for (...) { //SIMD operations }
With modified structures MagneticField/Interpolation LinearGridInterpolator3D::interpolate For the unit test using PfmCodeAnalyser UNHALTED_CORE_CYCLES Without SSE intrinsics: 2817340 With SSE intrinsics: 1375656 Speedup: 2.05
Reasons A lot of operations are made with the same data Modified data structures allow the use of movaps instructions, which significantly increases speedup Because of the new data structures, a lot of adaptation would be needed Alternately, the current data structures can be modifies to use more memory and have longer execution on unvectorized code, but fast on vectorized one
Code typedef Basic3DVector<float> ValueType; typedef ArrVec ValueType; union ArrVec { __m128 vec; float __attribute__ ((aligned(16))) arr[4]; arrvec() {} arrvec(float &f1, float &f2, float &f3) { arr[0] = f1; arr[1] = f2; arr[2] = f3; } };
Difference between setting __m128 x Unmodified: movl -868(%rbp), %eax movl %eax, -56(%rbp) movl -872(%rbp), %eax movl %eax, -60(%rbp) movl -876(%rbp), %eax movl %eax, -64(%rbp) movl -880(%rbp), %eax movl %eax, -68(%rbp) movss -68(%rbp), %xmm1 movss -64(%rbp), %xmm0 movaps %xmm1, %xmm2 unpcklps %xmm0, %xmm2 movss -60(%rbp), %xmm1 movss -56(%rbp), %xmm0 movaps %xmm1, %xmm3 unpcklps %xmm0, %xmm3 movaps %xmm3, %xmm0 movaps %xmm2, %xmm1 movlhps %xmm0, %xmm1 movaps %xmm1, %xmm0 movaps %xmm0, -960(%rbp) Modified: movaps -864(%rbp), %xmm0 movaps %xmm0, -944(%rbp)
Speed comparison
Custom SIMD library RecoLocalTracker/SiStripRecHitConverter SiStripRecHitMatcher::match For the CMSSW test using CMSSW performance monitoring tools UNHALTED_CORE_CYCLES With sampling period: 1000 Without SSE intrinsics match + project: 263153 With SSE intrinsics match: 204539 Speedup: 1.3
Timestamp counter results Only for the project computational part (inside the loop) Because of the noise when using on larger parts of code Sum of a 1000 runs: project old: 203960 project with intrinsics: 102576 project with a custom library: 110656 Speedup with intrinsics: 1.99 Speedup with a custom library: 1.84
Speedup comparison
Code class VectorSimd { private : __m128 _simd; inline VectorSimd(__m128 simd); public : inline VectorSimd(); inline VectorSimd(float x1, float x2, float x3, float x4); inline VectorSimd(float x); inline VectorSimd(const LocalVector& vec); inline const VectorSimd operator+(const VectorSimd& x) const; inline const VectorSimd operator*(const VectorSimd& x) const; inline const VectorSimd operator- (const VectorSimd& x) const; inline const VectorSimd operator/(const VectorSimd& x) const; inline float get(const int& n) const; inline VectorSimd getSimd(const int& n) const; inline void set(float x); inline void set(float x1, float x2, float x3, float x4); };
Code class GeomDetSimd { protected : VectorSimd _rotation[3]; VectorSimd _position; public : inline GeomDetSimd(const GeomDet* geomDet); inline VectorSimd rotate(const LocalPoint& lp) const; inline VectorSimd rotate(const VectorSimd& lp) const; inline VectorSimd shift(const VectorSimd& lp) const; inline VectorSimd negShift(const VectorSimd& lp) const; };
Double loop optimization (with the custom library) RecoTracker/MeasurementDet/TkGluedMeasurement Det::collectRecHits For the CMSSW test using CMSSW performance monitoring tools UNHALTED_CORE_CYCLES With sampling period: 1000 Without SSE intrinsics collectRecHits: 860332 + ε With SSE intrinsics collectRecHits: 594419 Speedup: 1.45
Speedup comparison
Suggestions using SIMD Trivial suggestion, but not always used because of maintainability, would be to use the operators outside as many loops as possible. With the loop being run many times this makes a moderate impact. For the maximum result and speedup in the long run, the best way to implement SIMD instructions is to modify the data structures already used in the CMSSW. The tests have shown this to have the maximum effect on the performance.
Suggestions using SIMD If the structures are not modified, but the SIMD instructions might still be useful, simple libraries of basic SIMD vectors and objects (like matrices) can be used as an alternative where needed. In any case, all of the initializations should be done outside the loops if possible because they have a significant impact on performance, especially if the libraries and not rewritten structures are used.
Suggestions using SIMD After each change to SIMD in the code, especially if there are no loops and the custom libraries are used, the performance should be tested, because there are a lot of cases where the performance will go down (or stay the same) due to the small amount of SIMD instructions and the long SIMD structures initialization time.
Suggestions using SIMD Rethink the algorithms and operation sequences. Sometimes the algorithm used for SISD is not the best one for SIMD, because it doesn't utilize the ability to use the operators on multiple data and consider that the data might be represented as vectors. (Simplest example: matrix – vector multiplication)
Conclusions Using SSE intrinsics mostly binds to using a single or, in some cases, double precision floating point data type, so the template now used in CMSSW loose their purpose To decrease load/set times modified data structures can be a solution Inlined functions can be used to structure the code and make it more manageable
Future work The custom library can be expanded with both more structures that are used in CMSSW and more methods for the matrix and vector classes that are currently implemented (starting from things as simple as dot product for vectors or matrix multiplication) The structures in CMSSW, which can be represented as vectors or arrays of vectors, and are used in a lot of computations after one initialization should be rewritten using SIMD intrinsics