Getting Started with Automatic Compiler Vectorization

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
Lecture 6 Programming the TMS320C6x Family of DSPs.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
SIMD Single instruction on multiple data – This form of parallel processing has existed since the 1960s – The idea is rather than executing array operations.
The University of Adelaide, School of Computer Science
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Compiler Challenges for High Performance Architectures
Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
Hyper-Threading Intel Compilers Andrey Naraikin Senior Software Engineer Software Products Division Intel Nizhny Novgorod Lab November 29, 2002.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
1 Tips and Tricks: Visual C Optimization Best Practices Kang Su Gatlin TLNL04 Program Manager Visual C++ Microsoft Corporation.
1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Performance Optimization Getting your programs to run faster CS 691.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Performance Optimization Getting your programs to run faster.
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Single Node Optimization Computational Astrophysics.
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
1 Lecture 5a: CPU architecture 101 boris.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Programming Technique in Matlab Allows Coder Generate Vectorized Friendly C Code Zijian Cao.
Shared Memory Parallelism - OpenMP
SIMD Multimedia Extensions
Exploiting Parallelism
Compilers for Embedded Systems
CSL718 : VLIW - Software Driven ILP
Vector Processing => Multimedia
SIMD Programming CS 240A, 2017.
Notes on Homework 1 DBS=328 lines BLK+UNJ=68 lines BLK=48 lines
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Register Pressure Guided Unroll-and-Jam
Compiler techniques for exposing ILP (cont)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Multi-Core Programming Assignment
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Introduction to Optimization
Optimizing single thread performance
Presentation transcript:

Getting Started with Automatic Compiler Vectorization David Apostal UND CSci 532 Guest Lecture Sept 14, 2017

Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads) Vector processing

Scalar Processing Processor uses one element from each array on each loop iteration. Registers #define N 32 float a[N], b[N], c[N]; for (int i = 0; i < N; i++) { a[i] = b[i] + c[i]; } b[i] a[i] c[i]

Vector Processing Compiler unrolls the loop body a number of times. Processor performs the same operation on a number of scalars (simultaneously) #define N 32 float a[N], b[N], c[N]; for (int i = 0; i < N; i=i+4) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; } Registers b[i] b[i+1] b[i+2] b[i+3] c[i] c[i+1] c[i+2] c[i+3] a[i] a[i+1] a[i+2] a[i+3]

Ease of Use / Portability Techniques Ease of Use / Portability Automatic Compiler Vectorization Compiler hints (directives / #pragmas) SIMD / Vector intrinsics Assembler instructions Programmer Control

Requirements / Guidelines Compiler optimization –O2 or higher Loop trip count known at runtime. A single entry and single exit Straight-line code Inner-most loop No function calls

Manage the Compiler Report Vectorization is enable by default -qopt-report-phase=vec -qopt-report=[1-5] Provides increasing levels of detail -qopt-report-routine:func1, func2

Exercise Test.cpp #include <math.h> void foo(float *theta, float *sth) { for (int I = 0; I < 128; i++) { sth[i] = sin(theta[i] + 3.1415927); } $ icpc –qopt-report-phase=vec \ –qopt-report=5 \ –qopt-report-routine:foo \ -c \ Test.cpp

Test.optrpt (1) Begin optimization report for: foo(float *, float *) Report from: Vector optimizations [vec] LOOP BEGIN at T1.cpp(3,5) <Multiversioned v1> … remark #15300: LOOP WAS VECTORIZED LOOP END <Multiversioned v2> remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning Good Bad

Aliasing Add a command-line option to tell compiler that function arguments are not aliased. Results in a single vectorized version of loop. $ icpc –qopt-report-phase=vec \ –qopt-report=5 \ –qopt-report-routine:foo \ -c \ -fargument-noalias \ Test.cpp

Test.optrpt (2) Begin optimization report for: foo(float *, float *) Report from: Vector optimizations [vec] LOOP BEGIN at T1.cpp(3,5) remark #15389: vectorization support: reference theta has unaligned access [ T1.cpp(4,18) ] remark #15389: vectorization support: reference sth has unaligned access [ T1.cpp(4,9) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 4 remark #15309: vectorization support: normalized vectorization overhead 0.081 remark #15417: vectorization support: number of FP up converts: single precision to double precision 1 [ T1.cpp(4,18) ] remark #15418: vectorization support: number of FP down converts: double precision to single precision 1 [ T1.cpp(4,9) ] remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 1 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15488: --- end vector loop cost summary --- LOOP END

Alignment Tell the compiler that arrays are aligned. #include <math.h> void foo(float *theta, float *sth) { __assume_aligned(theta, 32); __assume_aligned(sth, 32); for (int I = 0; I < 128; i++) { sth[i] = sin(theta[i] + 3.1415927); }

Test.optrpt (3) LOOP BEGIN at T1.cpp(3,5) remark #15388: vectorization support: reference theta has aligned access [ T1.cpp(6,18) ] remark #15388: vectorization support: reference sth has aligned access [ T1.cpp(6,9) ] remark #15305: vectorization support: vector length 8 remark #15309: vectorization support: normalized vectorization overhead 0.006 remark #15417: vectorization support: number of FP up converts: single precision to double precision 1 [ T1.cpp(6,18) ] remark #15418: vectorization support: number of FP down converts: double precision to single precision 1 [ T1.cpp(6,9) ] remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 1 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 113 remark #15477: vector loop cost: 20.000 remark #15478: estimated potential speedup: 5.640 remark #15482: vectorized math library calls: 1 remark #15487: type converts: 2 remark #15488: --- end vector loop cost summary --- LOOP END

Avoid type conversions Specify float (f) on function call and literal value. #include <math.h> void foo(float *theta, float *sth) { __assume_aligned(theta, 32); __assume_aligned(sth, 32); for (int I = 0; I < 128; i++) { sth[i] = sinf(theta[i] + 3.1415927f); }

T1.optrpt (4) LOOP BEGIN at T1.cpp(5,5) remark #15388: vectorization support: reference theta has aligned access [ T1.cpp(6,18) ] remark #15388: vectorization support: reference sth has aligned access [ T1.cpp(6,9) ] remark #15305: vectorization support: vector length 8 remark #15309: vectorization support: normalized vectorization overhead 0.013 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 1 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 110 remark #15477: vector loop cost: 9.870 remark #15478: estimated potential speedup: 11.130 remark #15482: vectorized math library calls: 1 remark #15488: --- end vector loop cost summary --- LOOP END

Resources https://software.intel.com/en-us/articles/vectorization-diagnostics-for-intelr-c-compiler-150-and-above https://software.intel.com/en-us/articles/vectorization-essential https://software.intel.com/sites/default/files/8c/a9/CompilerAutovectorizationGuide.pdf https://software.intel.com/en-us/articles/get-a-helping-hand-from-the-vectorization-advisor https://software.intel.com/en-us/mkl https://software.intel.com/en-us/intel-parallel-universe-magazine