Getting Reproducible Results with Intel® MKL 11.0 Todd Rosenquist Technical Consulting Engineer Intel® Math Kernel Library
The agenda Reproducible results in Intel MKL The symptom The problem The reality The requirements A conditional solution A beginner’s guide Performance Further resources Try the feature in the recently released Intel® MKL 11.0 voice over: composer and studio version XE 2013
Ever seen something like this? C:\Users\me>test.exe 4.012345678901111 4.012345678902222 Mention “repeatable” results
…or this? Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275 C:\Users\me>test.exe 4.012345678901111 C:\Users\me>test.exe 4.012345678902222 Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275 Mention “reproducibility” (as opposed to the repeatability). Then mention that we tend to discuss this in general as reproducibility
Order matters when doing floating point arithmetic. Why do results vary? Root cause for variations in results floating-point numbers order of computation matters! double precision example where (a+b)+c a+(b+c) 2-63 + 1 + -1 = 2-63 (infinitely precise result) (2-63 + 1) + -1 0 (correct IEEE single precision result) 2-63 + ( 1 + -1) 2-63 (correct IEEE single precision result) Order matters when doing floating point arithmetic.
Why does the order of operations change in Intel MKL? Optimizations instruction sets memory alignment affects grouping of data in registers multiple cores / multiple processors most functions are threaded to use as many cores as will give good scalability Non-deterministic task scheduling some algorithms use asynchronous task scheduling for optimal performance code path optimized to use all the processor features available on the system where the program is run Many optimizations require a change in order of operations.
Why are reproducible results important for Intel MKL users? Technical/legacy Software correctness is determined by comparison to previous ‘gold’ results. Debugging When developing and debugging, a higher degree of run-to-run stability is required to find potential problems Legal Accreditation or approval of software might require exact reproduction of previously defined results. Customer perception Developers may understand the technical issues with reproducibility but still require reproducible results since end users or customers will be disconcerted by the inconsistencies. Emphasis: CNR is not about getting better accuracy Source: Email correspondence with Kai Diethelm of GNS. see his whitepaper: http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf
Balancing Reproducibility and Performance: Conditional Numerical Reproducibility (CNR) Align memory — try Intel MKL memory allocation functions 64-byte alignment for processors in the next few years Memory alignment Set the number of threads to a constant number Use sequential libraries Number of threads Ensures that FP operations occur in order to ensure reproducible results Deterministic task scheduling Maintains consistent code paths across processors Will often mean lower performance on the latest processors Code path control New! Emphasize how we added the new to help customers Mention the OpenMP control briefly? PGI and Gnu OMP have this feature? Goal: Achieve best performance possible for cases that require reproducibility
Why “Conditional”? In Intel MKL 11.0 reproducibility is currently available under certain conditions: Within single operating systems / architecture Reproducibility only applies within the blue boxes, not between them… Reproducibility on all supported servers and workstations No support yet for Intel® Xeon Phi™ coprocessors Within a particular version of Intel MKL Results in version 11.0 update 1 may differ from results in version 11.0 Reproducibility controls in Intel MKL only affect Intel MKL functions Linux* IA32 Intel® 64 Windows* Mac OS X Connect this to customer requests
Conditions for reproducibility Aligned input and output arrays in function calls 16-byte alignment for the family of SSE instruction sets 32-byte alignment for AVX 64-byte alignment for future processors <- choose this to be safe Set the same number of computational threads for the library in each run Use the same Intel MKL parameters from run-to-run Example: You cannot call a function in 3 blocks in one run and 4 blocks in the next Use the new functions & controls to ensure deterministic task scheduling and to control code paths CNR controls must be set or called before any computational math functions in Intel MKL TBB does not change block size based on # of threads – this is good for reproducibility ask engineering about performance penalties for alignment ScaLAPACK – need to discuss with engineering. Hans: there’s more than deterministic reduction
Example - COMPATIBLE For reproducible results on Intel and Intel-compatible CPUs supporting SSE2 instructions or later function call mkl_cbwr_set(MKL_CBWR_COMPATIBLE) or environment variable set MKL_CBWR="COMPATIBLE" Note: MKL_CBWR_COMPATIBLE is provided because Intel and Intel compatible CPUs have approximation instructions (e.g., rcpps/rsqrtps) that may return different results. This option ensures that Intel MKL uses a SSE2-only codepath that does not contain any of these instructions.
Example – SSE2 For the same results on every Intel processor that supports SSE2 instructions or later function call mkl_cbwr_set(MKL_CBWR_SSE2) or environment variable set MKL_CBWR="SSE2" Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported
Example – SSE4.2 For the same results on every Intel processor that supports SSE4.2 instructions or later function call mkl_cbwr_set(MKL_CBWR_SSE4_2) or environment variable set MKL_CBWR= "SSE4_2" Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported
Example – deterministic task scheduling For consistent results on all supported processors without fixing the code branch function call mkl_cbwr_set(MKL_CBWR_AUTO) or environment variable set MKL_CBWR= "AUTO" Note This will ensure deterministic task scheduling It will not give you reproducibility from processor to processor
Example – Find out the best performing option from a pool of processors For the best option given a pool of computing resources in a grid setting, you may launch a simple program as follows #include <mkl.h> int main(void) { int my_cbwr_branch; /* Find the available MKL_CBWR_BRANCH */ my_cbwr_branch = mkl_cbwr_get_auto_branch(); if (!mkl_cbwr_set(my_cbwr_branch)) { printf(“Error in setting branch. Aborting…\n”); return;} return my_cbwr_branch; } Examine all results and use mkl_cbwr_set(<minimum_result>) The full list of options: COMPATIBLE 3 SSE2 4 SSE3 5 SSSE3 6 SSE4_1 7 SSE4_2 8 AVX 9 AVX2 10
Change this sort of inconsistency… C:\Users\me>test.exe 4.012345678901111 4.012345678902222 C:\Users\me>test.exe 4.012345678901111 Align memory Constant # of threads Turn on CNR with either mkl_cbwr_set(MKL_CBWR_AUTO) or set MKL_CBWR=AUTO
Change this inconsistency in results… C:\Users\me>test.exe 4.012345678901111 C:\Users\me>test.exe 4.012345678902222 Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275
…to get reproducible results? C:\Users\me>test.exe 4.012345678901111 C:\Users\me>test.exe 4.012345678901111 Align memory Constant # of threads Turn on CNR with either… mkl_cbwr_set(MKL_CBWR_SSE4_2) or set MKL_CBWR=SSE4_2 Intel® Xeon® Processor E5540 (Supporting SSE4.2 instructions) Intel® Xeon® Processor E3-1275 (Supporting AVX instructions)
https://softwareproductsurvey.intel.com/survey/150072/1afd/ What’s next? https://softwareproductsurvey.intel.com/survey/150072/1afd/
Further resources on conditional numerical reproducibility Intel MKL Documentation – online and in the product Intel MKL User’s Guide Reference Manual Knowledgebase articles on CNR Support Intel MKL user forum Intel Premier support Feedback Survey: https://softwareproductsurvey.intel.com/survey/150072/1afd/
New optimizations and features Support for the Intel® Xeon Phi™ coprocessor based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture) on Linux* only Optimizations using the new Intel® Advanced Vector Extensions 2 (AVX2) including the new FMA3 instructions FFTs: Completed support for real-to-complex transforms with sizes given by 64-bit integers Local threading control function mkl_set_num_threads_local()
Sept 18th, 2012 9:00AM Interesting ties between tools and new hardware features: How Intel Tools support the many new features in processors and coprocessors Oct 2nd, 2012 9:00AM Pointer Checker: Catch Out-of-Bounds Memory Accesses Easily! Oct 16th, 2012 9:00AM How Intel® Parallel Studio XE is used to improve the HMMER application Oct 30th, 2012 9:00AM Using the Intel® Math Kernel Library 11.0 and Compiler to Obtain Run-to-Run Reproducible Results Oct 9th, 2012 9:00AM Achieving better parallel performance of Fortran programs with Intel® VTune™ Amplifier XE profiling. Oct 23rd, 2012 9:00AM Three common Fortran mistakes you can avoid by using Intel® Inspector XE Nov 6th, 2012 9:00AM Avoid common parallelization mistakes with the help of Intel® Advisor XE Dec 4th, 2012 9:00AM Fortran 2008 Standard Parallel Programming Features in Intel® Fortran Composer XE* http://software.intel.com/en-us/fall-webinar-series-psxe-and-fsxe
Summary Evaluate CNR in the following: Provide feedback: Conditional Numerical Reproducibility (CNR) provides: reproducible results from run-to-run reproducible results from processor-to-processor the ability to balance reproducibility requirements with great performance Evaluate CNR in the following: Intel® Math Kernel Library 11.0 Intel® Composer XE 2013 Intel® Parallel Studio XE 2013 Intel® Cluster Studio XE 2013 Provide feedback: https://softwareproductsurvey.intel.com/survey/150072/1afd/