Optimizing the code using SSE intrinsics

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

The C ++ Language BY Shery khan. The C++ Language Bjarne Stroupstrup, the language’s creator C++ was designed to provide Simula’s facilities for program.

Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:

11/8/2005Comp 120 Fall November 9 classes to go! Read Section 7.5 especially important!

The University of Adelaide, School of Computer Science

Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Algorithm Analysis (Big O) CS-341 Dick Steflik. Complexity In examining algorithm efficiency we must understand the idea of complexity –Space complexity.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

1 Lecture 6 Performance Measurement and Improvement.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE.

1. 2 FUNCTION INLINE FUNCTION DIFFERENCE BETWEEN FUNCTION AND INLINE FUNCTION CONCLUSION 3.

Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.

Ultra sound solution Impact of C++ DSP optimization techniques.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

Introduction to MMX, XMM, SSE and SSE2 Technology

#include using namespace std; // Declare a function. void check(int, double, double); int main() { check(1, 2.3, 4.56); check(7, 8.9, 10.11); } void check(int.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.

A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

An Introduction to Programming with C++ Sixth Edition Chapter 8 More on the Repetition Structure.

1 Lecture 5a: CPU architecture 101 boris.

Algorithm Complexity is concerned about how fast or slow particular algorithm performs.

A few words on locality and arrays

Chapter Overview General Concepts IA-32 Processor Architecture

CMPT 438 Algorithms.

Insertion sort Loop invariants Dynamic memory

September 2 Performance Read 3.1 through 3.4 for Tuesday

UNIT 5 C Pointers.

Defining Performance Which airplane has the best performance?

Session 3 Memory Management

Friend Class Friend Class A friend class can access private and protected members of other class in which it is declared as friend. It is sometimes useful.

SIMD Multimedia Extensions

Review: Two Programming Paradigms

Exploiting Parallelism

Algorithm Analysis CSE 2011 Winter September 2018.

Morgan Kaufmann Publishers

Vector Processing => Multimedia

SIMD Programming CS 240A, 2017.

November 14 6 classes to go! Read

Constructors and Other Tools

CS 201 Fundamental Structures of Computer Science

Variables Title slide variables.

Prof. Bhushan Trivedi Director GLS Institute of Computer Technology

CSE 373 Data Structures and Algorithms

EE 193: Parallel Computing

Dr Tripty Singh Arrays.

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Multi-Core Programming Assignment

Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.

Interpreting Java Program Runtimes

Winter 2019 CISC101 5/30/2019 CISC101 Reminders

Machine-Independent Optimization

6- General Purpose GPU Programming

C. M. Overstreet Old Dominion University Fall 2005

Presentation transcript:

Optimizing the code using SSE intrinsics SIMD experiments Optimizing the code using SSE intrinsics

Summary Software speedup when using SIMD operations Problems with using SIMD operations (long loading/setting times, fixed data types) Possible solutions (new data structures) Comparison of speedup when using SIMD operations in different contexts Speedup using a custom library for code manageability Speedup when optimizing a double loop Suggestions on using SIMD instructions

Performance increase in vectorized code RecoLocalTracker/SiStripRecHitConverter SiStripRecHitMatcher::project For the unit test using PfmCodeAnalyser UNHALTED_CORE_CYCLES Without SSE intrinsics: 28297893 With SSE intrinsics: 16407853 Speedup: 1.72

Reasons Vectorized arithmetic operations increase the performance Loading/setting of data into the xmm registers takes a long time, because the data isn't adapted for use with SIMD instructions

Code GlobalPoint globalpointini=(det->surface()).toGlobal(strip.first); GlobalPoint globalpointend=(det->surface()).toGlobal(strip.second); //toGlobal __m128 ret1 = _mm_set1_ps(locp1.y()); ret1 = _mm_mul_ps(m1, ret1); __m128 t2 = _mm_set1_ps(locp1.z()); t2 = _mm_mul_ps(m2, t2); ret1 = _mm_add_ps(ret1, t2); t2 = _mm_set1_ps(locp1.x()); t2 = _mm_mul_ps(m0, t2); ret1 = _mm_add_ps(t2, ret1); ret1 = _mm_add_ps(ret1, p0vec); __m128 ret2 = _mm_set1_ps(locp2.y()); ret2 = _mm_mul_ps(m1, ret2); t2 = _mm_set1_ps(locp2.z()); t2 = _mm_mul_ps(m2, t2); ret2 = _mm_add_ps(ret2, t2); t2 = _mm_set1_ps(locp2.x()); t2 = _mm_mul_ps(m0, t2); ret2 = _mm_add_ps(t2, ret2); ret2 = _mm_add_ps(ret2, p0vec);

In the context of CMSSW RecoLocalTracker/SiStripRecHitConverter SiStripRecHitMatcher::match For the CMSSW test using CMSSW performance monitoring tools UNHALTED_CORE_CYCLES With sampling period: 1000 Without SSE intrinsics match + project: 263153 With SSE intrinsics match: 204539 Speedup: 1.3

Reasons Virtual methods are used, so without analyzing the code from a larger perspective, not much can be done Some of the code uses doubles, and with only a few operations to vectorize, we gain no speedup because of the load time Some operations are not vectorizeable

Code m.Invert(); AlgebraicVector2 solution = m * c; __m128d mult = _mm_set1_pd(1./(m00*m11 – m01*m10)); __m128d resultmatmul = _mm_mul_pd(_mm_add_pd(_mm_mul_pd(minv10, c0vec), _mm_mul_pd(minv00, _mm_set1_pd(m11*((float *)&ret1)[1]+m10*((float *)&ret1)[0]))), mult); //set all the constant __m128 and __m128d type variables for (...) { //SIMD operations }

With modified structures MagneticField/Interpolation LinearGridInterpolator3D::interpolate For the unit test using PfmCodeAnalyser UNHALTED_CORE_CYCLES Without SSE intrinsics: 2817340 With SSE intrinsics: 1375656 Speedup: 2.05

Reasons A lot of operations are made with the same data Modified data structures allow the use of movaps instructions, which significantly increases speedup Because of the new data structures, a lot of adaptation would be needed Alternately, the current data structures can be modifies to use more memory and have longer execution on unvectorized code, but fast on vectorized one

Code typedef Basic3DVector<float> ValueType; typedef ArrVec ValueType; union ArrVec { __m128 vec; float __attribute__ ((aligned(16))) arr[4]; arrvec() {} arrvec(float &f1, float &f2, float &f3) { arr[0] = f1; arr[1] = f2; arr[2] = f3; } };

Difference between setting __m128 x Unmodified: movl -868(%rbp), %eax movl %eax, -56(%rbp) movl -872(%rbp), %eax movl %eax, -60(%rbp) movl -876(%rbp), %eax movl %eax, -64(%rbp) movl -880(%rbp), %eax movl %eax, -68(%rbp) movss -68(%rbp), %xmm1 movss -64(%rbp), %xmm0 movaps %xmm1, %xmm2 unpcklps %xmm0, %xmm2 movss -60(%rbp), %xmm1 movss -56(%rbp), %xmm0 movaps %xmm1, %xmm3 unpcklps %xmm0, %xmm3 movaps %xmm3, %xmm0 movaps %xmm2, %xmm1 movlhps %xmm0, %xmm1 movaps %xmm1, %xmm0 movaps %xmm0, -960(%rbp) Modified: movaps -864(%rbp), %xmm0 movaps %xmm0, -944(%rbp)

Speed comparison

Custom SIMD library RecoLocalTracker/SiStripRecHitConverter SiStripRecHitMatcher::match For the CMSSW test using CMSSW performance monitoring tools UNHALTED_CORE_CYCLES With sampling period: 1000 Without SSE intrinsics match + project: 263153 With SSE intrinsics match: 204539 Speedup: 1.3

Timestamp counter results Only for the project computational part (inside the loop) Because of the noise when using on larger parts of code Sum of a 1000 runs: project old: 203960 project with intrinsics: 102576 project with a custom library: 110656 Speedup with intrinsics: 1.99 Speedup with a custom library: 1.84

Speedup comparison

Code class VectorSimd { private : __m128 _simd; inline VectorSimd(__m128 simd); public : inline VectorSimd(); inline VectorSimd(float x1, float x2, float x3, float x4); inline VectorSimd(float x); inline VectorSimd(const LocalVector& vec); inline const VectorSimd operator+(const VectorSimd& x) const; inline const VectorSimd operator*(const VectorSimd& x) const; inline const VectorSimd operator- (const VectorSimd& x) const; inline const VectorSimd operator/(const VectorSimd& x) const; inline float get(const int& n) const; inline VectorSimd getSimd(const int& n) const; inline void set(float x); inline void set(float x1, float x2, float x3, float x4); };

Code class GeomDetSimd { protected : VectorSimd _rotation[3]; VectorSimd _position; public : inline GeomDetSimd(const GeomDet* geomDet); inline VectorSimd rotate(const LocalPoint& lp) const; inline VectorSimd rotate(const VectorSimd& lp) const; inline VectorSimd shift(const VectorSimd& lp) const; inline VectorSimd negShift(const VectorSimd& lp) const; };

Double loop optimization (with the custom library) RecoTracker/MeasurementDet/TkGluedMeasurement Det::collectRecHits For the CMSSW test using CMSSW performance monitoring tools UNHALTED_CORE_CYCLES With sampling period: 1000 Without SSE intrinsics collectRecHits: 860332 + ε With SSE intrinsics collectRecHits: 594419 Speedup: 1.45

Speedup comparison

Suggestions using SIMD Trivial suggestion, but not always used because of maintainability, would be to use the operators outside as many loops as possible. With the loop being run many times this makes a moderate impact. For the maximum result and speedup in the long run, the best way to implement SIMD instructions is to modify the data structures already used in the CMSSW. The tests have shown this to have the maximum effect on the performance.

Suggestions using SIMD If the structures are not modified, but the SIMD instructions might still be useful, simple libraries of basic SIMD vectors and objects (like matrices) can be used as an alternative where needed. In any case, all of the initializations should be done outside the loops if possible because they have a significant impact on performance, especially if the libraries and not rewritten structures are used.

Suggestions using SIMD After each change to SIMD in the code, especially if there are no loops and the custom libraries are used, the performance should be tested, because there are a lot of cases where the performance will go down (or stay the same) due to the small amount of SIMD instructions and the long SIMD structures initialization time.

Suggestions using SIMD Rethink the algorithms and operation sequences. Sometimes the algorithm used for SISD is not the best one for SIMD, because it doesn't utilize the ability to use the operators on multiple data and consider that the data might be represented as vectors. (Simplest example: matrix – vector multiplication)

Conclusions Using SSE intrinsics mostly binds to using a single or, in some cases, double precision floating point data type, so the template now used in CMSSW loose their purpose To decrease load/set times modified data structures can be a solution Inlined functions can be used to structure the code and make it more manageable

Future work The custom library can be expanded with both more structures that are used in CMSSW and more methods for the matrix and vector classes that are currently implemented (starting from things as simple as dot product for vectors or matrix multiplication) The structures in CMSSW, which can be represented as vectors or arrays of vectors, and are used in a lot of computations after one initialization should be rewritten using SIMD intrinsics