Scalar and Serial Optimization

Scalar and Serial Optimization
Financial Services Engineering Software and Services Group Intel Corporation

Intel® Xeon Phi ™Coprocessor
Agenda Objective Algorithmic and Language Precision, Accuracy, Function Domain Lab Step 2 Scalar, Serial Optimization Summary iXPTC 2013 Intel® Xeon Phi ™Coprocessor

Objective

Objective of Scalar and Serial Optimization
Obtain the most efficient implementation for the problem at hand Identify the opportunity for vectorization and parallelization Create Base to account for vectorization and parallelization gains Avoid situation when vectorized, slower code was parallelized and create a false impression of performance gain

Algorithmic and Language

Algorithmic Optimizations
Elevate constants out of the core loops Compiler can do it, but it need your cooperation Group constants together Avoid and replace expensive operations divide a constant can usually be replace by multiplying its reciprocal Strength reduction in hot loop People like inductive method, because it’s clean Iterative can strength reduce the operation involved In this example, exp() is replace by a simple multiplication const double dt = T / (double)TIMESTEPS; const double vDt = V * sqrt(dt); for(int i = 0; i <= TIMESTEPS; i++){ double price = S * exp(vDt * (2.0 * i - TIMESTEPS)); cell[i] = max(price - X, 0); } const double factor = exp(vDt * 2); double price = S * exp(-vDt(2+TIMESTEPS)); for(int i = 0; i <= TIMESTEPS; i++){ price = factor * price; cell[i] = max(price - X, 0); }

Understand C/C++ Type Conversion Rule
C/C++ implicit type conversion rule double is higher in the type hierarchy than float in C/C++ A variables promotes to double if it operates with another double. 0.5*V*V will trigger a implicit conversion if V is a float double is at least 2X slower than float Type convert is very expensive. It is 6 cycles inside VPU engine Avoid using floating point literals, Always type your constants Use const float HALF = 0.5f; Choose the right runtime functions API calls sqrt(), exp(), log() requires double parameter sqrtf(), expf(), logf() takes float parameter

Use Mathematical Equivalence
Direct implementation of mathematical formula can result in redundant computation Understand your target machine Transform your calculation to the basic operations Reuse as much as you can previous results Reduced add/multiply operations make the result more accurate Examples: Black-Scholes Formula

Precision, Accuracy and Domain

Understand the floating point arithmetic Unit
Vector Processing Unit executing vector FP instruction X87 unit also exist can execute FP Instruction as well Compiler choose which place to use for FP operation VPU is preferred place because of its speed VPU can make the FP results reproducible as well Use X87 should be used for two reasons Reproduce the same results 15 years ago, right or wrong Need generate FP exceptions for debugging purpose Intel Compiler default to VPU the user can override with –fp-model strict

Choose a Right Precision for your problem
Under the precision requirement of your problem For some algorithm single precision is good enough Example 1: Newton-Raphson function approximation Example 2: Monte Carlo if rounding error is controlled SP will always be faster by at least 2x Mixed precision is also an option Conversion between two FP formats are not free Parameter Single Double Extended Precision (IEEE_X)* Format width in bits 32 64 128 Sign width in bits 1 mantissa 23 52 112 (113 implied) Exponent width in bits 8 11 15 Max binary exponent +127 +1023 +16383 Min binary exponent - 126 - 1022 -16382 Exponent bias Max value ~3.4 x 1038 ~1.8 x ~1.2 x Value (Min normalized) ~1.2 x 10-38 ~2.2 x ~3.4 x Value (Min denormalized) ~1.4 x 10-45 ~4.9 x ~6.5 x

Use the Right Accuracy Mode
Accuracy affects the performance of your program Choose the accuracy your problem requires Mix accuracies have the same accuracy as the lower ones Choices for Accuracy Intel MKL Accuracy Mode HA, LA, EP: API calls vmlSetMode(VML_EP); Intel® Compiler: Compiler switches –fimf_precision=low,high,medium -fimf_accuracy_bits=11

Understand the Domain of Your Problem
80/20 rules in computer arithmetic: 20% of time spent on getting good for 80% input, 80% of time spent on getting the corner case right Every function call has to check NaN, denomals, etc Exclude corner cases can result in higher performance Intel Compiler support domain exclusion Use -fimf-domain-exclusion=<n1> <n1> exclusive or of bit masks 15: common exclusions 31: avoid all corner case Values to Exclude Mask none Extreme value 1 NaNs 2 Infinities 4 Denormals 8 Zeros 16

Combination compiler Switches
Lowest precision sequence for SP/DP -fimf-precision=low -fimf-domain-exclusion=15 Low precision for DP -fimf-domain-exclusion=15 -fimf-accuracy-bits=22 Low precision for SP even lower for DP -fimf-precision=low -fimf-domain-exclusion=11 Lower accuracy than default 4 ulps, higher than above -fimf-max-error=2048 -fimf-domain-exclusion=15 Adding the option -fimf-domain-exclusion=15 to the default -fp-model fast=2 Vectorized, high precision of division, square root and transcendental functions from libsvml -fp-model-precise –no-prec-div –no-prec-sqrt –fast- transcendentals –fimf-precision=high

Lab Step 2 Scalar, Serial Optimization

Step 2 Scalar, Serial Optimization
Inspect your source code for language related inefficiencies Type your constants Explicit about C/C++ run time API Experiment your precision and accuracy setting -fimf-precision=low -no-prec-div -no-prec-sqrt Experiment your Domain Exclusion -fimf-domain-exclusion=15

Summary

Summary Optimize your Algorithm first
Avoid unexpected C/C++ Type Conversions Choose the right representation, Accuracy level Experiment fp-model fast=2 with Intel Compiler

Optimization Notice Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

Compiler Support

FP switches for Intel Compiler and GCC
-fp-model for Intel Compiler fast [=1] optimized for performance (default) fast =2 aggressive approximation approximations precise value-safe optimizations only source|double|extended imply “precise” unless overridden except enable floating point exception semantics strict precise + except + disable fma + don’t assume default floating-point environment Floating Point controls in GCC f[no-]fast-math is high level option It is off by default (different from Intel Compiler) Ofast turns on –ffast-math funsafe-math-optimizations turn on reassociation Reproducibility of exceptions Assumptions about floating-point environment /fp: for Windows -mp, -Op, IPF-fltacc deprecated in a future release -fp-model source|double|extended specify the method for expression evaluation, but also set –fp-model precise unless explicitly overwritten

Value Safety In SAFE mode, the compiler’s hands are tied
All the following are prohibited. Optimization at Stake Reassociation Flush-to-zero Expression Evaluation, various mathematical simplifications Approximate divide and sqrt Math library approximations x / x  1.0 x could be 0.0, ∞, or NaN x – y  - (y – x) If x equals y, x – y is +0.0 while – (y – x) is -0.0 x – x  0.0 x could be ∞ or NaN x * 0.0  0.0 x could be -0.0, ∞, or NaN x  x x could be -0.0 (x + y) + z  x + (y + z) General re-association is not value safe (x == x)  true x could be NaN Most examples are from the C99 standard fast=2 may include limited range for complex division. and limited range implementations of other math functions on Intel® MIC Architecture 11/9/201811/9/2018

Floating-Point Behavior
Floating-point exception flags are set by Intel IMCI unmasking and trapping is not supported. attempts to unmask will result in seg fault -fp-trap (C) are disabled -fp-model except or strict will yield (slow!) x87 code that supports unmasking and trapping of floating-point exceptions Denormals are supported by Intel IMCI Needs –no-ftz or –fp-model precise (like host) 512 bit vector transcendental math functions available 4 elementary functions are available RECIP, RSQRT, EXP2, LOG2 DIV and SQRT benefit from these 4 function SVML can even be inlined avoid function call overhead Many options to select different implementations See Differences in floating-point arithmetic between Intel(R) Xeon processors and the Intel Xeon Phi(TM) coprocessor for details and status -fimf-domain-exclusion=, -fimf-precision, -fimf-max-error, … IEEE_EXCEPTIONS IEEE_FEATURES IEEE_ARITHMETIC modules not yet functional (Fortran 13.0) Hope for update 2

Further Information Microsoft Visual C++* Floating-Point Optimization The Intel® C++ and Fortran Compiler Documentation, “Floating Point Operations” “Consistency of Floating-Point Results using the Intel® Compiler” results-using-the-intel-compiler/ “Differences in Floating-Point Arithmetic between Intel® Xeon® Processors and the Intel® Xeon Phi™ Coprocessor” point-differences-sept11.pdf Goldberg, David: "What Every Computer Scientist Should Know About Floating-Point Arithmetic“ Computing Surveys, March 1991, pg. 203 11/9/201811/9/2018

Scalar and Serial Optimization

Similar presentations

Presentation on theme: "Scalar and Serial Optimization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalar and Serial Optimization

Similar presentations

Presentation on theme: "Scalar and Serial Optimization"— Presentation transcript:

Similar presentations

About project

Feedback