Scalar and Serial Optimization

Slides:



Advertisements
Similar presentations
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 2: Data types and addressing modes dr.ir. A.C. Verschueren.
L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.
Fabián E. Bustamante, Spring 2007 Floating point Today IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Floating-Point and High-Level Languages Programming Languages Spring 2004.
CPSC 321 Computer Architecture ALU Design – Integer Addition, Multiplication & Division Copyright 2002 David H. Albonesi and the University of Rochester.
Getting Reproducible Results with Intel® MKL 11.0
Fundamentals of Python: From First Programs Through Data Structures
CEN 316 Computer Organization and Design Computer Arithmetic Floating Point Dr. Mansour AL Zuair.
CS 101: Numerical Computing Abhiram Ranade. Representing Integers “int x;” : reserves one cell in memory for x. One cell: “One word” has 32 capacitors.
CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.
8-1 Embedded Systems Fixed-Point Math and Other Optimizations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 15 - Floating.
Lecture 9: Floating Point
CSC 221 Computer Organization and Assembly Language
Computer Arithmetic Floating Point. We need a way to represent –numbers with fractions, e.g., –very small numbers, e.g., –very large.
CDA 3101 Fall 2013 Introduction to Computer Organization
Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.
Floating Point Operations
CHAPTER 5: Representing Numerical Data
Floating Points & IEEE 754.
Floating Point Representations
© David Kirk/NVIDIA and Wen-mei W
Bitwise Operations C includes operators that permit working with the bit-level representation of a value. You can: - shift the bits of a value to the left.
Expressions and Assignment Statements
Code Optimization.
Morgan Kaufmann Publishers Arithmetic for Computers
Chapter 7: Expressions and Assignment Statements
Computer Architecture & Operations I
Expressions and Assignment Statements
Integer Division.
Morgan Kaufmann Publishers Arithmetic for Computers
Lecture 9: Floating Point
Topics IEEE Floating Point Standard Rounding Floating Point Operations
Tokens in C Keywords Identifiers Constants
Floating Point Numbers: x 10-18
UNIVERSITY OF MASSACHUSETTS Dept
Floating-Point and High-Level Languages
Chapter 7: Expressions and Assignment Statements
Expressions and Assignment Statements
Floating Point Arithmetics
Chapter 6 Floating Point
Arithmetic for Computers
Lecture 10: Floating Point, Digital Design
Floating Point Math, Fixed-Point Math and Other Optimizations
Topic 3d Representation of Real Numbers
Expressions and Assignment Statements
Lecture 3 Expressions Richard Gesick.
CSCE 350 Computer Architecture
ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations
CS 61C: Great Ideas in Computer Architecture Floating Point Arithmetic
Expressions and Assignment Statements
College of Computer Science and Engineering
How to represent real numbers
CSCE 4643 GPU Programming Lecture 9 - Floating Point Considerations
Lectures on Numerical Methods
October 17 Chapter 4 – Floating Point Read 5.1 through 5.3 1/16/2019
Computing in COBOL: The Arithmetic Verbs and Intrinsic Functions
Data Types and Expressions
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Chapter 7 Expressions and Assignment Statements.
Morgan Kaufmann Publishers Arithmetic for Computers
Topic 3d Representation of Real Numbers
Computer Organization and Assembly Language
Data Types and Expressions
Data Types and Expressions
Presentation transcript:

Scalar and Serial Optimization Financial Services Engineering Software and Services Group Intel Corporation

Intel® Xeon Phi ™Coprocessor Agenda Objective Algorithmic and Language Precision, Accuracy, Function Domain Lab Step 2 Scalar, Serial Optimization Summary iXPTC 2013 Intel® Xeon Phi ™Coprocessor

Objective

Objective of Scalar and Serial Optimization Obtain the most efficient implementation for the problem at hand Identify the opportunity for vectorization and parallelization Create Base to account for vectorization and parallelization gains Avoid situation when vectorized, slower code was parallelized and create a false impression of performance gain

Algorithmic and Language

Algorithmic Optimizations Elevate constants out of the core loops Compiler can do it, but it need your cooperation Group constants together Avoid and replace expensive operations divide a constant can usually be replace by multiplying its reciprocal Strength reduction in hot loop People like inductive method, because it’s clean Iterative can strength reduce the operation involved In this example, exp() is replace by a simple multiplication const double dt = T / (double)TIMESTEPS; const double vDt = V * sqrt(dt); for(int i = 0; i <= TIMESTEPS; i++){ double price = S * exp(vDt * (2.0 * i - TIMESTEPS)); cell[i] = max(price - X, 0); } const double factor = exp(vDt * 2); double price = S * exp(-vDt(2+TIMESTEPS)); for(int i = 0; i <= TIMESTEPS; i++){ price = factor * price; cell[i] = max(price - X, 0); }

Understand C/C++ Type Conversion Rule C/C++ implicit type conversion rule double is higher in the type hierarchy than float in C/C++ A variables promotes to double if it operates with another double. 0.5*V*V will trigger a implicit conversion if V is a float double is at least 2X slower than float Type convert is very expensive. It is 6 cycles inside VPU engine Avoid using floating point literals, Always type your constants Use const float HALF = 0.5f; Choose the right runtime functions API calls sqrt(), exp(), log() requires double parameter sqrtf(), expf(), logf() takes float parameter

Use Mathematical Equivalence Direct implementation of mathematical formula can result in redundant computation Understand your target machine Transform your calculation to the basic operations Reuse as much as you can previous results Reduced add/multiply operations make the result more accurate Examples: Black-Scholes Formula

Precision, Accuracy and Domain

Understand the floating point arithmetic Unit Vector Processing Unit executing vector FP instruction X87 unit also exist can execute FP Instruction as well Compiler choose which place to use for FP operation VPU is preferred place because of its speed VPU can make the FP results reproducible as well Use X87 should be used for two reasons Reproduce the same results 15 years ago, right or wrong Need generate FP exceptions for debugging purpose Intel Compiler default to VPU the user can override with –fp-model strict

Choose a Right Precision for your problem Under the precision requirement of your problem For some algorithm single precision is good enough Example 1: Newton-Raphson function approximation Example 2: Monte Carlo if rounding error is controlled SP will always be faster by at least 2x Mixed precision is also an option Conversion between two FP formats are not free Parameter Single Double Extended Precision (IEEE_X)* Format width in bits 32 64 128 Sign width in bits 1 mantissa 23 52 112 (113 implied) Exponent width in bits 8 11 15 Max binary exponent +127 +1023 +16383 Min binary exponent - 126 - 1022 -16382 Exponent bias Max value ~3.4 x 1038 ~1.8 x 10-308 ~1.2 x 10-4932 Value (Min normalized) ~1.2 x 10-38 ~2.2 x 10-308 ~3.4 x 10-4932 Value (Min denormalized) ~1.4 x 10-45 ~4.9 x 10-324 ~6.5 x 10-4966

Use the Right Accuracy Mode Accuracy affects the performance of your program Choose the accuracy your problem requires Mix accuracies have the same accuracy as the lower ones Choices for Accuracy Intel MKL Accuracy Mode HA, LA, EP: API calls vmlSetMode(VML_EP); Intel® Compiler: Compiler switches –fimf_precision=low,high,medium -fimf_accuracy_bits=11

Understand the Domain of Your Problem 80/20 rules in computer arithmetic: 20% of time spent on getting good for 80% input, 80% of time spent on getting the corner case right Every function call has to check NaN, denomals, etc Exclude corner cases can result in higher performance Intel Compiler support domain exclusion Use -fimf-domain-exclusion=<n1> <n1> exclusive or of bit masks 15: common exclusions 31: avoid all corner case Values to Exclude Mask none Extreme value 1 NaNs 2 Infinities 4 Denormals 8 Zeros 16

Combination compiler Switches Lowest precision sequence for SP/DP -fimf-precision=low -fimf-domain-exclusion=15 Low precision for DP -fimf-domain-exclusion=15 -fimf-accuracy-bits=22 Low precision for SP even lower for DP -fimf-precision=low -fimf-domain-exclusion=11 Lower accuracy than default 4 ulps, higher than above -fimf-max-error=2048 -fimf-domain-exclusion=15 Adding the option -fimf-domain-exclusion=15 to the default -fp-model fast=2 Vectorized, high precision of division, square root and transcendental functions from libsvml -fp-model-precise –no-prec-div –no-prec-sqrt –fast- transcendentals –fimf-precision=high

Lab Step 2 Scalar, Serial Optimization

Step 2 Scalar, Serial Optimization Inspect your source code for language related inefficiencies Type your constants Explicit about C/C++ run time API Experiment your precision and accuracy setting -fimf-precision=low -no-prec-div -no-prec-sqrt Experiment your Domain Exclusion -fimf-domain-exclusion=15

Summary

Summary Optimize your Algorithm first Avoid unexpected C/C++ Type Conversions Choose the right representation, Accuracy level Experiment fp-model fast=2 with Intel Compiler

Optimization Notice Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Compiler Support

FP switches for Intel Compiler and GCC -fp-model for Intel Compiler fast [=1] optimized for performance (default) fast =2 aggressive approximation approximations precise value-safe optimizations only source|double|extended imply “precise” unless overridden except enable floating point exception semantics strict precise + except + disable fma + don’t assume default floating-point environment Floating Point controls in GCC f[no-]fast-math is high level option It is off by default (different from Intel Compiler) Ofast turns on –ffast-math funsafe-math-optimizations turn on reassociation Reproducibility of exceptions Assumptions about floating-point environment /fp: for Windows -mp, -Op, IPF-fltacc deprecated in a future release -fp-model source|double|extended specify the method for expression evaluation, but also set –fp-model precise unless explicitly overwritten

Value Safety In SAFE mode, the compiler’s hands are tied All the following are prohibited. Optimization at Stake Reassociation Flush-to-zero Expression Evaluation, various mathematical simplifications Approximate divide and sqrt Math library approximations x / x  1.0 x could be 0.0, ∞, or NaN x – y  - (y – x) If x equals y, x – y is +0.0 while – (y – x) is -0.0 x – x  0.0 x could be ∞ or NaN x * 0.0  0.0 x could be -0.0, ∞, or NaN x + 0.0  x x could be -0.0 (x + y) + z  x + (y + z) General re-association is not value safe (x == x)  true x could be NaN Most examples are from the C99 standard fast=2 may include limited range for complex division. and limited range implementations of other math functions on Intel® MIC Architecture 11/9/201811/9/2018

Floating-Point Behavior Floating-point exception flags are set by Intel IMCI unmasking and trapping is not supported. attempts to unmask will result in seg fault -fp-trap (C) are disabled -fp-model except or strict will yield (slow!) x87 code that supports unmasking and trapping of floating-point exceptions Denormals are supported by Intel IMCI Needs –no-ftz or –fp-model precise (like host) 512 bit vector transcendental math functions available 4 elementary functions are available RECIP, RSQRT, EXP2, LOG2 DIV and SQRT benefit from these 4 function SVML can even be inlined avoid function call overhead Many options to select different implementations See Differences in floating-point arithmetic between Intel(R) Xeon processors and the Intel Xeon Phi(TM) coprocessor for details and status -fimf-domain-exclusion=, -fimf-precision, -fimf-max-error, … IEEE_EXCEPTIONS IEEE_FEATURES IEEE_ARITHMETIC modules not yet functional (Fortran 13.0) Hope for update 2

Further Information Microsoft Visual C++* Floating-Point Optimization http://msdn2.microsoft.com/en-us/library/aa289157(vs.71).aspx The Intel® C++ and Fortran Compiler Documentation, “Floating Point Operations” “Consistency of Floating-Point Results using the Intel® Compiler” http://software.intel.com/en-us/articles/consistency-of-floating-point- results-using-the-intel-compiler/ “Differences in Floating-Point Arithmetic between Intel® Xeon® Processors and the Intel® Xeon Phi™ Coprocessor” http://software.intel.com/sites/default/files/article/326703/floating- point-differences-sept11.pdf Goldberg, David: "What Every Computer Scientist Should Know About Floating-Point Arithmetic“ Computing Surveys, March 1991, pg. 203 11/9/201811/9/2018