Getting Reproducible Results with Intel® MKL 11.0

Slides:



Advertisements
Similar presentations
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
Advertisements

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Intel ® Xeon ® Processor E v2 Product Family Ivy Bridge Improvements *Other names and brands may be claimed as the property of others. FeatureXeon.
© 2014 Microsoft Corporation. All rights reserved.
Software and Services Group Optimization Notice Advancing HPC == advancing the business of software Rich Altmaier Director of Engineering Sept 1, 2011.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Perceptual Computing SDK Q2, 2013 Update Building Momentum with the SDK 1 Barry Solomon, Senior Product Manager, Intel Xintian Wu, Architect, Intel.
Software & Services Group Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property.
Intel® Education Fluid Math™
HEVC Commentary and a call for local temporal distortion metrics Mark Buxton - Intel Corporation.
Intel ® Server Platform Transitions Nov / Dec ‘07.
Intel® Education Read With Me Intel Solutions Summit 2015, Dallas, TX.
Intel® Education Learning in Context: Science Journal Intel Solutions Summit 2015, Dallas, TX.
Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Principles of Programming Chapter 1: Introduction  In this chapter you will learn about:  Overview of Computer Component  Overview of Programming 
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Tuning Python Applications Can Dramatically Increase Performance Vasilij Litvinov Software Engineer, Intel.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
OpenMP * Support in Clang/LLVM: Status Update and Future Directions 2014 LLVM Developers' Meeting Alexey Bataev, Zinovy Nis Intel.
Orion Granatir Omar Rodriguez GDC 3/12/10 Don’t Dread Threads.
Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.
1 Intel® Many Integrated Core (Intel® MIC) Architecture MARC Program Status and Essentials to Programming the Intel ® Xeon ® Phi ™ Coprocessor (based on.
Conditions and Terms of Use
Intel® Education Learning in Context: Concept Mapping Intel Solutions Summit 2015, Dallas, TX.
DEPARTMENT OF COMPUTER SCIENCE & TECHNOLOGY FACULTY OF SCIENCE & TECHNOLOGY UNIVERSITY OF UWA WELLASSA 1 CST 221 OBJECT ORIENTED PROGRAMMING(OOP) ( 2 CREDITS.
Enterprise Platforms & Services Division (EPSD) JBOD Update October, 2012 Intel Confidential Copyright © 2012, Intel Corporation. All rights reserved.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.
IBIS-AMI and Direction Decisions
IBIS-AMI and Direction Indication February 17, 2015 Michael Mirmak.
Copyright © 2006 Intel Corporation. WiMAX Wireless Broadband Access: The World Goes Wireless Michael Chen Director of Product & Platform Marketing Group.
Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.
The Drive to Improved Performance/watt and Increasing Compute Density Steve Pawlowski Intel Senior Fellow GM, Architecture and Planning CTO, Digital Enterprise.
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 1 How Does The Intel® Parallel.
Copyright © 2011 Intel Corporation. All rights reserved. Openlab Confidential CERN openlab ICT Challenges workshop Claudio Bellini Business Development.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Boxed Processor Stocking Plans Server & Mobile Q1’08 Product Available through February’08.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
How to Enforce Reproducibility with your Existing Intel ® Math Kernel Library Code Noah Clemons Technical Consulting Engineer Intel ® Developer Products.
Winning with Storage Foundation 5.x – 4.x End Of Life Process Winning with Storage Foundation 5.x.
Installation of Storage Foundation for Windows High Availability 5.1 SP2 1 Daniel Schnack Principle Technical Support Engineer.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Principles of Programming Chapter 1: Introduction  In this chapter you will learn about:  Overview of Computer Component  Overview of Programming 
Template Library for Vector Loops A presentation of P0075 and P0076
INTEL CONFIDENTIAL Intel® Smart Connect Technology Remote Wake with WakeMyPC November 2013 – Revision 1.2 CDI/IBP #:
Tuning Threaded Code with Intel® Parallel Amplifier.
© Copyright Khronos Group, Page 1 Real-Time Shallow Water Simulation with OpenCL for CPUs Arnon Peleg, Adam Lake software, Intel OpenCL WG, The.
16 February 2011 Herbert Cornelius Intel. Copyright © 2011 Intel Corporation. All rights reserved. *Other brands and names are the property of their respective.
Intel® Many Integrated Core Architecture Software & Services Group, Developer Relations Division Copyright© 2011, Intel Corporation. All rights reserved.
1 Game Developers Conference 2008 Comparative Analysis of Game Parallelization Dmitry Eremin Senior Software Engineer, Intel Software and Solutions Group.
Microsoft Project Licensing
Using Parallelspace TEAM Models to Design and Create Custom Profiles
Optimizing Batched Linear Algebra on Intel® Xeon Phi™ Processors
BLIS optimized for EPYCTM Processors
Parallelspace PowerPoint Template for ArchiMate® 2.1 version 1.1
Parallelspace PowerPoint Template for ArchiMate® 2.1 version 2.0
Many-core Software Development Platforms
Intel® Parallel Studio and Advisor
A Proposed New Standard: Common Privacy Vulnerability Scoring System (CPVSS) Jonathan Fox, Privacy Office/PDIT Harold A. Toomey, PSG/ISecG Jason M. Fung,
12/26/2018 5:07 AM Leap forward with fast, agile & trusted solutions from Intel & Microsoft* Eman Yarlagadda (for Christine McMonigal) Hybrid Cloud – Product.
Ideas for adding FPGA Accelerators to DPDK
Virtio/Vhost Status Quo and Near-term Plan
From DTS to SSIS, Redesign or Upgrade
Enabling TSO in OvS-DPDK
By Vipin Varghese Application Engineer (NCSD)
Microsoft Virtual Academy
A Scalable Approach to Virtual Switching
Expanded CPU resource pool with
Presentation transcript:

Getting Reproducible Results with Intel® MKL 11.0 Todd Rosenquist Technical Consulting Engineer Intel® Math Kernel Library

The agenda Reproducible results in Intel MKL The symptom The problem The reality The requirements A conditional solution A beginner’s guide Performance Further resources Try the feature in the recently released Intel® MKL 11.0 voice over: composer and studio version XE 2013

Ever seen something like this? C:\Users\me>test.exe 4.012345678901111 4.012345678902222 Mention “repeatable” results

…or this? Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275 C:\Users\me>test.exe 4.012345678901111 C:\Users\me>test.exe 4.012345678902222 Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275 Mention “reproducibility” (as opposed to the repeatability). Then mention that we tend to discuss this in general as reproducibility

Order matters when doing floating point arithmetic. Why do results vary? Root cause for variations in results floating-point numbers  order of computation matters! double precision example where (a+b)+c  a+(b+c) 2-63 + 1 + -1 = 2-63 (infinitely precise result) (2-63 + 1) + -1  0 (correct IEEE single precision result) 2-63 + ( 1 + -1)  2-63 (correct IEEE single precision result) Order matters when doing floating point arithmetic.

Why does the order of operations change in Intel MKL? Optimizations instruction sets memory alignment affects grouping of data in registers multiple cores / multiple processors most functions are threaded to use as many cores as will give good scalability Non-deterministic task scheduling some algorithms use asynchronous task scheduling for optimal performance code path optimized to use all the processor features available on the system where the program is run Many optimizations require a change in order of operations.

Why are reproducible results important for Intel MKL users?   Technical/legacy Software correctness is determined by comparison to previous ‘gold’ results. Debugging When developing and debugging, a higher degree of run-to-run stability is required to find potential problems Legal Accreditation or approval of software might require exact reproduction of previously defined results. Customer perception Developers may understand the technical issues with reproducibility but still require reproducible results since end users or customers will be disconcerted by the inconsistencies. Emphasis: CNR is not about getting better accuracy Source: Email correspondence with Kai Diethelm of GNS. see his whitepaper: http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf

Balancing Reproducibility and Performance: Conditional Numerical Reproducibility (CNR) Align memory — try Intel MKL memory allocation functions 64-byte alignment for processors in the next few years Memory alignment Set the number of threads to a constant number Use sequential libraries Number of threads Ensures that FP operations occur in order to ensure reproducible results Deterministic task scheduling Maintains consistent code paths across processors Will often mean lower performance on the latest processors Code path control New! Emphasize how we added the new to help customers Mention the OpenMP control briefly? PGI and Gnu OMP have this feature? Goal: Achieve best performance possible for cases that require reproducibility

Why “Conditional”? In Intel MKL 11.0 reproducibility is currently available under certain conditions: Within single operating systems / architecture Reproducibility only applies within the blue boxes, not between them… Reproducibility on all supported servers and workstations No support yet for Intel® Xeon Phi™ coprocessors Within a particular version of Intel MKL Results in version 11.0 update 1 may differ from results in version 11.0 Reproducibility controls in Intel MKL only affect Intel MKL functions Linux* IA32 Intel® 64 Windows* Mac OS X Connect this to customer requests

Conditions for reproducibility Aligned input and output arrays in function calls 16-byte alignment for the family of SSE instruction sets 32-byte alignment for AVX 64-byte alignment for future processors <- choose this to be safe Set the same number of computational threads for the library in each run Use the same Intel MKL parameters from run-to-run Example: You cannot call a function in 3 blocks in one run and 4 blocks in the next Use the new functions & controls to ensure deterministic task scheduling and to control code paths CNR controls must be set or called before any computational math functions in Intel MKL TBB does not change block size based on # of threads – this is good for reproducibility ask engineering about performance penalties for alignment ScaLAPACK – need to discuss with engineering. Hans: there’s more than deterministic reduction

Example - COMPATIBLE For reproducible results on Intel and Intel-compatible CPUs supporting SSE2 instructions or later function call mkl_cbwr_set(MKL_CBWR_COMPATIBLE) or environment variable set MKL_CBWR="COMPATIBLE" Note: MKL_CBWR_COMPATIBLE is provided because Intel and Intel compatible CPUs have approximation instructions (e.g., rcpps/rsqrtps) that may return different results. This option ensures that Intel MKL uses a SSE2-only codepath that does not contain any of these instructions.

Example – SSE2 For the same results on every Intel processor that supports SSE2 instructions or later function call mkl_cbwr_set(MKL_CBWR_SSE2) or environment variable set MKL_CBWR="SSE2" Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported

Example – SSE4.2 For the same results on every Intel processor that supports SSE4.2 instructions or later function call mkl_cbwr_set(MKL_CBWR_SSE4_2) or environment variable set MKL_CBWR= "SSE4_2" Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported

Example – deterministic task scheduling For consistent results on all supported processors without fixing the code branch function call mkl_cbwr_set(MKL_CBWR_AUTO) or environment variable set MKL_CBWR= "AUTO" Note This will ensure deterministic task scheduling It will not give you reproducibility from processor to processor

Example – Find out the best performing option from a pool of processors For the best option given a pool of computing resources in a grid setting, you may launch a simple program as follows #include <mkl.h> int main(void) { int my_cbwr_branch; /* Find the available MKL_CBWR_BRANCH */ my_cbwr_branch = mkl_cbwr_get_auto_branch(); if (!mkl_cbwr_set(my_cbwr_branch)) { printf(“Error in setting branch. Aborting…\n”); return;} return my_cbwr_branch; } Examine all results and use mkl_cbwr_set(<minimum_result>) The full list of options: COMPATIBLE 3 SSE2 4 SSE3 5 SSSE3 6 SSE4_1 7 SSE4_2 8 AVX 9 AVX2 10

Change this sort of inconsistency… C:\Users\me>test.exe 4.012345678901111 4.012345678902222 C:\Users\me>test.exe 4.012345678901111 Align memory Constant # of threads Turn on CNR with either mkl_cbwr_set(MKL_CBWR_AUTO) or set MKL_CBWR=AUTO

Change this inconsistency in results… C:\Users\me>test.exe 4.012345678901111 C:\Users\me>test.exe 4.012345678902222 Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275

…to get reproducible results? C:\Users\me>test.exe 4.012345678901111 C:\Users\me>test.exe 4.012345678901111 Align memory Constant # of threads Turn on CNR with either… mkl_cbwr_set(MKL_CBWR_SSE4_2) or set MKL_CBWR=SSE4_2 Intel® Xeon® Processor E5540 (Supporting SSE4.2 instructions) Intel® Xeon® Processor E3-1275 (Supporting AVX instructions)

https://softwareproductsurvey.intel.com/survey/150072/1afd/ What’s next? https://softwareproductsurvey.intel.com/survey/150072/1afd/

Further resources on conditional numerical reproducibility Intel MKL Documentation – online and in the product Intel MKL User’s Guide Reference Manual Knowledgebase articles on CNR Support Intel MKL user forum Intel Premier support Feedback Survey: https://softwareproductsurvey.intel.com/survey/150072/1afd/

New optimizations and features Support for the Intel® Xeon Phi™ coprocessor based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture) on Linux* only Optimizations using the new Intel® Advanced Vector Extensions 2 (AVX2) including the new FMA3 instructions FFTs: Completed support for real-to-complex transforms with sizes given by 64-bit integers Local threading control function mkl_set_num_threads_local()

Sept 18th, 2012 9:00AM Interesting ties between tools and new hardware features: How Intel Tools support the many new features in processors and coprocessors Oct 2nd, 2012 9:00AM Pointer Checker: Catch Out-of-Bounds Memory Accesses Easily! Oct 16th, 2012 9:00AM How Intel® Parallel Studio XE is used to improve the HMMER application Oct 30th, 2012 9:00AM Using the Intel® Math Kernel Library 11.0 and Compiler to Obtain Run-to-Run Reproducible Results Oct 9th, 2012 9:00AM Achieving better parallel performance of Fortran programs with Intel® VTune™ Amplifier XE profiling. Oct 23rd, 2012 9:00AM Three common Fortran mistakes you can avoid by using Intel® Inspector XE Nov 6th, 2012 9:00AM Avoid common parallelization mistakes with the help of Intel® Advisor XE Dec 4th, 2012 9:00AM Fortran 2008 Standard Parallel Programming Features in Intel® Fortran Composer XE* http://software.intel.com/en-us/fall-webinar-series-psxe-and-fsxe

Summary Evaluate CNR in the following: Provide feedback: Conditional Numerical Reproducibility (CNR) provides: reproducible results from run-to-run reproducible results from processor-to-processor the ability to balance reproducibility requirements with great performance Evaluate CNR in the following: Intel® Math Kernel Library 11.0 Intel® Composer XE 2013 Intel® Parallel Studio XE 2013 Intel® Cluster Studio XE 2013 Provide feedback: https://softwareproductsurvey.intel.com/survey/150072/1afd/