Optimizing the code using SSE intrinsics

Optimizing the code using SSE intrinsics
SIMD experiments Optimizing the code using SSE intrinsics

Summary Software speedup when using SIMD operations
Problems with using SIMD operations (long loading/setting times, fixed data types) Possible solutions (new data structures) Comparison of speedup when using SIMD operations in different contexts Speedup using a custom library for code manageability Speedup when optimizing a double loop Suggestions on using SIMD instructions

Performance increase in vectorized code
RecoLocalTracker/SiStripRecHitConverter SiStripRecHitMatcher::project For the unit test using PfmCodeAnalyser UNHALTED_CORE_CYCLES Without SSE intrinsics: With SSE intrinsics: Speedup: 1.72

Reasons Vectorized arithmetic operations increase the performance
Loading/setting of data into the xmm registers takes a long time, because the data isn't adapted for use with SIMD instructions

Code GlobalPoint globalpointini=(det->surface()).toGlobal(strip.first); GlobalPoint globalpointend=(det->surface()).toGlobal(strip.second); //toGlobal __m128 ret1 = _mm_set1_ps(locp1.y()); ret1 = _mm_mul_ps(m1, ret1); __m128 t2 = _mm_set1_ps(locp1.z()); t2 = _mm_mul_ps(m2, t2); ret1 = _mm_add_ps(ret1, t2); t2 = _mm_set1_ps(locp1.x()); t2 = _mm_mul_ps(m0, t2); ret1 = _mm_add_ps(t2, ret1); ret1 = _mm_add_ps(ret1, p0vec); __m128 ret2 = _mm_set1_ps(locp2.y()); ret2 = _mm_mul_ps(m1, ret2); t2 = _mm_set1_ps(locp2.z()); t2 = _mm_mul_ps(m2, t2); ret2 = _mm_add_ps(ret2, t2); t2 = _mm_set1_ps(locp2.x()); t2 = _mm_mul_ps(m0, t2); ret2 = _mm_add_ps(t2, ret2); ret2 = _mm_add_ps(ret2, p0vec);

In the context of CMSSW RecoLocalTracker/SiStripRecHitConverter SiStripRecHitMatcher::match For the CMSSW test using CMSSW performance monitoring tools UNHALTED_CORE_CYCLES With sampling period: 1000 Without SSE intrinsics match + project: With SSE intrinsics match: Speedup: 1.3

Reasons Virtual methods are used, so without analyzing the code from a larger perspective, not much can be done Some of the code uses doubles, and with only a few operations to vectorize, we gain no speedup because of the load time Some operations are not vectorizeable

Code m.Invert(); AlgebraicVector2 solution = m * c;
__m128d mult = _mm_set1_pd(1./(m00*m11 – m01*m10)); __m128d resultmatmul = _mm_mul_pd(_mm_add_pd(_mm_mul_pd(minv10, c0vec), _mm_mul_pd(minv00, _mm_set1_pd(m11*((float *)&ret1)[1]+m10*((float *)&ret1)[0]))), mult); //set all the constant __m128 and __m128d type variables for (...) { //SIMD operations }

With modified structures
MagneticField/Interpolation LinearGridInterpolator3D::interpolate For the unit test using PfmCodeAnalyser UNHALTED_CORE_CYCLES Without SSE intrinsics: With SSE intrinsics: Speedup: 2.05

Reasons A lot of operations are made with the same data
Modified data structures allow the use of movaps instructions, which significantly increases speedup Because of the new data structures, a lot of adaptation would be needed Alternately, the current data structures can be modifies to use more memory and have longer execution on unvectorized code, but fast on vectorized one

Code typedef Basic3DVector<float> ValueType;
typedef ArrVec ValueType; union ArrVec { __m128 vec; float __attribute__ ((aligned(16))) arr[4]; arrvec() {} arrvec(float &f1, float &f2, float &f3) { arr[0] = f1; arr[1] = f2; arr[2] = f3; } };

Difference between setting __m128 x
Unmodified: movl -868(%rbp), %eax movl %eax, -56(%rbp) movl -872(%rbp), %eax movl %eax, -60(%rbp) movl -876(%rbp), %eax movl %eax, -64(%rbp) movl -880(%rbp), %eax movl %eax, -68(%rbp) movss -68(%rbp), %xmm1 movss -64(%rbp), %xmm0 movaps %xmm1, %xmm2 unpcklps %xmm0, %xmm2 movss -60(%rbp), %xmm1 movss -56(%rbp), %xmm0 movaps %xmm1, %xmm3 unpcklps %xmm0, %xmm3 movaps %xmm3, %xmm0 movaps %xmm2, %xmm1 movlhps %xmm0, %xmm1 movaps %xmm1, %xmm0 movaps %xmm0, -960(%rbp) Modified: movaps -864(%rbp), %xmm0 movaps %xmm0, -944(%rbp)

Speed comparison

Custom SIMD library RecoLocalTracker/SiStripRecHitConverter SiStripRecHitMatcher::match For the CMSSW test using CMSSW performance monitoring tools UNHALTED_CORE_CYCLES With sampling period: 1000 Without SSE intrinsics match + project: With SSE intrinsics match: Speedup: 1.3

Timestamp counter results
Only for the project computational part (inside the loop) Because of the noise when using on larger parts of code Sum of a 1000 runs: project old: project with intrinsics: project with a custom library: Speedup with intrinsics: 1.99 Speedup with a custom library: 1.84

Speedup comparison

Code class VectorSimd { private : __m128 _simd; inline VectorSimd(__m128 simd); public : inline VectorSimd(); inline VectorSimd(float x1, float x2, float x3, float x4); inline VectorSimd(float x); inline VectorSimd(const LocalVector& vec); inline const VectorSimd operator+(const VectorSimd& x) const; inline const VectorSimd operator*(const VectorSimd& x) const; inline const VectorSimd operator- (const VectorSimd& x) const; inline const VectorSimd operator/(const VectorSimd& x) const; inline float get(const int& n) const; inline VectorSimd getSimd(const int& n) const; inline void set(float x); inline void set(float x1, float x2, float x3, float x4); };

Code class GeomDetSimd { protected : VectorSimd _rotation[3]; VectorSimd _position; public : inline GeomDetSimd(const GeomDet* geomDet); inline VectorSimd rotate(const LocalPoint& lp) const; inline VectorSimd rotate(const VectorSimd& lp) const; inline VectorSimd shift(const VectorSimd& lp) const; inline VectorSimd negShift(const VectorSimd& lp) const; };

Double loop optimization (with the custom library)
RecoTracker/MeasurementDet/TkGluedMeasurement Det::collectRecHits For the CMSSW test using CMSSW performance monitoring tools UNHALTED_CORE_CYCLES With sampling period: 1000 Without SSE intrinsics collectRecHits: ε With SSE intrinsics collectRecHits: Speedup: 1.45

Speedup comparison

Suggestions using SIMD
Trivial suggestion, but not always used because of maintainability, would be to use the operators outside as many loops as possible. With the loop being run many times this makes a moderate impact. For the maximum result and speedup in the long run, the best way to implement SIMD instructions is to modify the data structures already used in the CMSSW. The tests have shown this to have the maximum effect on the performance.

If the structures are not modified, but the SIMD instructions might still be useful, simple libraries of basic SIMD vectors and objects (like matrices) can be used as an alternative where needed. In any case, all of the initializations should be done outside the loops if possible because they have a significant impact on performance, especially if the libraries and not rewritten structures are used.

After each change to SIMD in the code, especially if there are no loops and the custom libraries are used, the performance should be tested, because there are a lot of cases where the performance will go down (or stay the same) due to the small amount of SIMD instructions and the long SIMD structures initialization time.

Rethink the algorithms and operation sequences. Sometimes the algorithm used for SISD is not the best one for SIMD, because it doesn't utilize the ability to use the operators on multiple data and consider that the data might be represented as vectors. (Simplest example: matrix – vector multiplication)

Conclusions Using SSE intrinsics mostly binds to using a single or, in some cases, double precision floating point data type, so the template now used in CMSSW loose their purpose To decrease load/set times modified data structures can be a solution Inlined functions can be used to structure the code and make it more manageable

Future work The custom library can be expanded with both more structures that are used in CMSSW and more methods for the matrix and vector classes that are currently implemented (starting from things as simple as dot product for vectors or matrix multiplication) The structures in CMSSW, which can be represented as vectors or arrays of vectors, and are used in a lot of computations after one initialization should be rewritten using SIMD intrinsics

Optimizing the code using SSE intrinsics

Similar presentations

Presentation on theme: "Optimizing the code using SSE intrinsics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing the code using SSE intrinsics

Similar presentations

Presentation on theme: "Optimizing the code using SSE intrinsics"— Presentation transcript:

Similar presentations

About project

Feedback