Template Library for Vector Loops A presentation of P0075 and P0076

Slides:



Advertisements
Similar presentations
Software & Services Group Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property.
Advertisements

Endpoints Proposal Update Jim Dinan MPI Forum Hybrid Working Group June, 2014.
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Intel ® Xeon ® Processor E v2 Product Family Ivy Bridge Improvements *Other names and brands may be claimed as the property of others. FeatureXeon.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Software and Services Group Optimization Notice Advancing HPC == advancing the business of software Rich Altmaier Director of Engineering Sept 1, 2011.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Perceptual Computing SDK Q2, 2013 Update Building Momentum with the SDK 1 Barry Solomon, Senior Product Manager, Intel Xintian Wu, Architect, Intel.
Software & Services Group Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
HEVC Commentary and a call for local temporal distortion metrics Mark Buxton - Intel Corporation.
A Move Toward Agile APM: Application Performance Management Frank Ober, Performance Engineer June 2012.
Intel ® Server Platform Transitions Nov / Dec ‘07.
Intel® Education Read With Me Intel Solutions Summit 2015, Dallas, TX.
Intel® Education Learning in Context: Science Journal Intel Solutions Summit 2015, Dallas, TX.
Getting Reproducible Results with Intel® MKL 11.0
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Tuning Python Applications Can Dramatically Increase Performance Vasilij Litvinov Software Engineer, Intel.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
OpenMP * Support in Clang/LLVM: Status Update and Future Directions 2014 LLVM Developers' Meeting Alexey Bataev, Zinovy Nis Intel.
Orion Granatir Omar Rodriguez GDC 3/12/10 Don’t Dread Threads.
Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.
IBIS-AMI and Direction Indication February 17, 2015 Updated Feb. 20, 2015 Michael Mirmak.
1 Intel® Many Integrated Core (Intel® MIC) Architecture MARC Program Status and Essentials to Programming the Intel ® Xeon ® Phi ™ Coprocessor (based on.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Copyright © 2013 Intel Corporation. All rights reserved. Digital Signage for Growing Businesses November 2013.
Enterprise Platforms & Services Division (EPSD) JBOD Update October, 2012 Intel Confidential Copyright © 2012, Intel Corporation. All rights reserved.
Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.
IBIS-AMI and Direction Decisions
IBIS-AMI and Direction Indication February 17, 2015 Michael Mirmak.
Copyright © 2006 Intel Corporation. WiMAX Wireless Broadband Access: The World Goes Wireless Michael Chen Director of Product & Platform Marketing Group.
Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.
The Drive to Improved Performance/watt and Increasing Compute Density Steve Pawlowski Intel Senior Fellow GM, Architecture and Planning CTO, Digital Enterprise.
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 1 How Does The Intel® Parallel.
Copyright © 2011 Intel Corporation. All rights reserved. Openlab Confidential CERN openlab ICT Challenges workshop Claudio Bellini Business Development.
Boxed Processor Stocking Plans Server & Mobile Q1’08 Product Available through February’08.
INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Single Node Optimization Computational Astrophysics.
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
INTEL CONFIDENTIAL Intel® Smart Connect Technology Remote Wake with WakeMyPC November 2013 – Revision 1.2 CDI/IBP #:
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Tuning Threaded Code with Intel® Parallel Amplifier.
© Copyright Khronos Group, Page 1 Real-Time Shallow Water Simulation with OpenCL for CPUs Arnon Peleg, Adam Lake software, Intel OpenCL WG, The.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Intel® Many Integrated Core Architecture Software & Services Group, Developer Relations Division Copyright© 2011, Intel Corporation. All rights reserved.
1 Game Developers Conference 2008 Comparative Analysis of Game Parallelization Dmitry Eremin Senior Software Engineer, Intel Software and Solutions Group.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
BLIS optimized for EPYCTM Processors
Morgan Kaufmann Publishers
Vector Processing => Multimedia
Many-core Software Development Platforms
OpenFabrics Interfaces: Past, present, and future
Multi-core CPU Computing Straightforward with OpenMP
Modeling Parallelism with Intel® Parallel Advisor
A Proposed New Standard: Common Privacy Vulnerability Scoring System (CPVSS) Jonathan Fox, Privacy Office/PDIT Harold A. Toomey, PSG/ISecG Jason M. Fung,
12/26/2018 5:07 AM Leap forward with fast, agile & trusted solutions from Intel & Microsoft* Eman Yarlagadda (for Christine McMonigal) Hybrid Cloud – Product.
Ideas for adding FPGA Accelerators to DPDK
Enabling TSO in OvS-DPDK
By Vipin Varghese Application Engineer (NCSD)
A Scalable Approach to Virtual Switching
Expanded CPU resource pool with
Presentation transcript:

Template Library for Vector Loops A presentation of P0075 and P0076 Pablo Halpern, Intel Corp 2015-01-13

Overview What are the goals of vector and parallel extensions? An overview of the parallelism TS Summary of proposed index-based loops (P0075) Library syntax for vector and parallel loops based on indexes, not iterators Support for arbitrary reductions and inductions Description of proposed vector execution policies (P0076) Range of vector architectures supported Wavefront execution: how vector execution differs from thread parallelism The difference between unseq and vec execution policies

What are the goals of vector and parallel extensions? Efficient exploitation of modern parallel hardware Multicore processors Vector (SIMD) units GPUs and other coprocessors Conformance to the style and tradition of modern C++ Friendly to programmers already familiar with other parallel-programming systems Reasonable conformance to thread and vector progress assumptions.

Summary of the Parallelism TS (N4507) A collection of algorithms that can be executed in parallel using one of a set of parallel execution policies. The parallel execution policies defined in N4507 are: sequential_execution_policy (seq): No parallelism parallel_execution_policy (par): Thread-based parallelism parallel_vector_execution_policy (par_vec): Same as par but with restricted synchronization, allowing for use of SIMD vector units. Notably absent: vector_execution_policy (vec): vector order of evaluation Example: parallel::for_each(parallel::par, v.begin(), v.end(), [&](double& x){ f(x, 9.5); });

Index-based loops (P0075) Overview P0075 for_loop(par, 0, n, [&](int i){ A[i] = f(B[i], C[2*i]); }); OpenMP Equivalent #pragma omp parallel for for (int i=0; i<n; ++i) { A[i] = f(B[i], C[2*i]); }

Strided loops and flow control P0075 for_loop_strided(par, n, 0, -2, [&](int i){ if (B[i] < 0) return; // return from lambda A[i] = f(B[i], C[2*i]); }); OpenMP Equivalent #pragma omp parallel for for (int i=n; i>0; i -= 2) { if (B[i] < 0) continue; A[i] = f(B[i], C[2*i]); }

Induction variables P0075 int j = 0; for_loop(par, 0, n, induction(j, 2), [&](int i, int jv){ A[i] = f(B[i], C[jv]); }); assert(j == 2*n); OpenMP Equivalent int j = 0; #pragma omp parallel for for (int i=0; i<n; ++i, j+=2) { A[i] = f(B[i], C[j]); } assert(j == 2*n); could reuse name “j”

local (race-free) partial sum Reduction P0075 float sum = 0.0; for_loop(par, 0, n, reduction_plus(sum), [&](int i, float& sum) { sum += f(B[i], C[2*i]); }); OpenMP Equivalent float sum = 0.0; #pragma omp parallel for \ reduction(+:sum) for (int i=0; i<n; ++i) { sum += f(B[i], C[2*i]); } reuse name “sum” local (race-free) partial sum

User-defined Reductions P0075 MyType accum; constexpr MyType ident{…}; MyType op(MyType, MyType); for_loop(par, 0, n, reduction(accum, ident, // identity value op), // reduction operation [&](int i, auto& accum){ accum = op(accum, f(B[i])); }); OpenMP Equivalent MyType accum; constexpr MyType ident{…}; MyType op(MyType, MyType); #pragma declare reduction( rop : MyType : omp_out=op(omp_out,omp_in) : omp_priv=ident) #pragma omp parallel for \ reduction(rop : accum) for (int i=0; i<n; ++i) { accum = op(accum, f(B[i])); }

Overview of vector execution policies (P0076) seq par unseq par_vec vec unsequenced_execution_policy (unseq) relaxed sequencing applicable to STL algorithms vector_execution_policy (vec) necessary conditions for classic vector loop execution applicable to for_loop and for_loop_strided Allows vectorization of loops with certain dependence patterns Both policies use a single OS thread Let applications avoid disturbing existing threading

Step A(i+1) can begin at or before end of step A(i) Vector architectures: “Long vector” machines: Cray (CM1 & CM2), CDC (Star-100) Step A(i+1) can begin at or before end of step A(i) A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Steps A(i) and A(i+1) execute concurrently in fixed-width registers Vector architectures: SIMD: x86 (AVX & SSE), ARM (NEON), Power (AltiVec) Steps A(i) and A(i+1) execute concurrently in fixed-width registers A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Vector architectures: Software pipelining B(0) B(1) Compiler orders instructions to maximize use of CPU pipeline and minimize latency. B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Wavefront Application (Sequencing for vec) for_loop template applies a function to a sequence of arguments All of the preceding vector architectures execute instructions in a predictable wavefront. No earlier application may fall behind a later application Enables exploitation of “forward dependencies” Makes vector_execution_policy safe to use on any loop that can be auto-vectorized. Rules phrasing in P0076r0 are complete but complex. A simplification is being investigated.

Wavefront for “Long vector” machines Time A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Vector architectures: SIMD: x86 (AVX & SSE), ARM (NEON), Power (AltiVec) Time B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Wavefront for Software pipelining Time A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

vec Covers Gap Between seq and unseq for_loop(unseq, 1, N, [&](int i) { V[i] = U[i]*A; U[i] = V[i]+B; }); unseq loops that work with unseq semantics for_loop(vec, 1, N, [&](int i) { V[i] = U[i+1]*A; U[i] = V[i-1]+B; }); vec loops that work with vector semantics for_loop(seq, 1, N, [&](int i) { V[i] = U[i-1]*A; U[i] = V[i+1]+B; }); seq Without ‘vec’, middle loop would have to run as ‘seq’, and programmer could pray that auto-vectorizer kicks in. Or be fissioned into two loops and pay bandwidth overheads. loops requiring sequential execution

vec_off Invokes its argument, but sequenced as if entire invocation is one big instruction. extern int* p; for_loop( vec, 0, n, [&](int i) { y[i] += y[i+1]; if(y[i]<0) { vec_off([]{ *p++ = i; }); }

Vendor Extension via Subclassing OpenMP Equivalent (without vectorize_remainder) Vendor Extension via Subclassing #pragma omp simd safelen(8) for( int i=0; i<1912; ++i ) { Z[i+8] = Z[i]*A; }); struct my_policy: vector_execution_policy { static const int safelen = 8; static const bool vectorize_remainder = true; }; Compiler can find these compile-time values knowing just the type of the policy. No interprocedural analysis required. for_loop( my_policy(), 0, 1912, [&](int i) { Z[i+8] = Z[i]*A; }); The “vectorize_remainder” is an extension not available in OpenMP. The scheme is extensible in the sense that vendors could specify more members that their compilers would recognize. Other compilers would just ignore the extra members.

Possible Future Directions Algorithms with certain dependence patterns do not prevent vectorization of enclosing algorithms and, depending on the target architecture, may themselves be vectorized (may or may not be profitable). These vector algorithms are not part of the current proposal – they are future work, consistent with the current proposal. // Histogram a[b[i]]++; // compress / expand if (cond(i)) { a[i] = b[i] * c[j++]; }

Why vec Only For for_loop? The semantics of the vec execution are only well-defined for loops We are not yet sure how to specify them for algorithms Possible area for future work Not clear that vec has useful meaning for STL algorithms Nonetheless it is extremely valuable for for_loop and for_loop_strided.

Summary unseq_execution_policy vec_execution_policy par unseq par_vec vec unseq_execution_policy relaxed sequencing applicable to STL algorithms vec_execution_policy necessary conditions for classic vector loop execution applicable to for_loop and for_loop_strided Both policies use a single OS thread Let applications avoid disturbing existing threading

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Alternatives (from N4238) Lock-step model Not consistent with seq fallback Explicit ordering point model Warts grew Explicit temporaries and helper functions proliferated Seemed to increase the difficulty of vector programming Why mess with decades of success?

Examples of the Complications auto tmp = A[i + 1]; parallel::wavefront_ordering_pt(); A[i] = 2*tmp; A[i] = 2*A[i+1]; OR A[i] = 2*parallel::wavefront_rvalue(A[i + 1]); auto tmp = expr; auto& ref = A[B[i]]; parallel::wavefront_off([&]{ ref = tmp; }); A[B[i]] = expr; OR parallel::wavefront_assign(A[B[i]]) = expr;