1 www.intel.com/software/products Intel® Compilers For Xeon™ Processor.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Parallel Processing with OpenMP
Program Optimization (Chapter 5)
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Programmability Issues
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Introduction to Advanced Topics Chapter 1 Mooly Sagiv Schrierber
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Threads Irfan Khan Myo Thein What Are Threads ? a light, fine, string like length of material made up of two or more fibers or strands of spun cotton,
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
IA-32 Processor Architecture
1 Lecture 6 Performance Measurement and Improvement.
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Intel Software College.
Introduction to Program Optimizations Chapter 11 Mooly Sagiv.
Intermediate Code. Local Optimizations
Hyper-Threading Intel Compilers Andrey Naraikin Senior Software Engineer Software Products Division Intel Nizhny Novgorod Lab November 29, 2002.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Intel Software College.
Multi-core Programming Tools. 2 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core.
Architecture Basics ECE 454 Computer Systems Programming
1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.
Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.
Intel® Composer XE for HPC customers July 2010 Denis Makoshenko, Intel, SSG.
5.3 Machine-Independent Compiler Features
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Performance Optimization Getting your programs to run faster CS 691.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Performance of mathematical software Agner Fog Technical University of Denmark
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Performance Optimization Getting your programs to run faster.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.
Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.
Single Node Optimization Computational Astrophysics.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
Visual C++ Optimizations Jonathan Caves Principal Software Engineer Visual C++ Microsoft Corporation.
Lecture 38: Compiling for Modern Architectures 03 May 02
Computer Architecture Principles Dr. Mike Frank
Optimization for the Linux kernel and Linux OS C. Tyler McAdams
Code Optimization I: Machine Independent Optimizations
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Vector Processing => Multimedia
Superscalar Processors & VLIW Processors
Optimizing program performance
Parallel Computing Explained How to Parallelize a Code
Lecture 11: Machine-Dependent Optimization
Dynamic Binary Translators and Instrumenters
Presentation transcript:

1 Intel® Compilers For Xeon™ Processor

Agenda  General  Xeon™ processor optimizations  Loop level optimizations  Multi-pass optimizations  Other

Agenda  General  Xeon™ processor optimizations  Loop level optimizations  Multi-pass optimizations  Other

General Optimizations  /Od, -O0: disable optimizations  /Zi, -g: Create Symbols  /O1, -O1: Optimizes for speed without increasing code size – i.e. disables library function inlining  /O2, -O2 – default – Optimize for speed  /O3, -O3 – High-level optimizations

Agenda  General  Xeon™ processor optimizations  Loop level optimizations  Multi-pass optimizations  Other

Instruction Scheduling  Schedule instructions to be optimal for specific processor instruction latencies and cache sizes WindowsLinux Pentium ® processors and Pentium processors with MMX™ technology -G5-tpp5 Pentium Pro, Pentium II and Pentium III processors -G6 (Default) -tpp6 (Default) Pentium 4 processor -G7-tpp7 Note: default may change in future compilers

Shift/Multiply Latency  Pentium –Shift has ~1x latency of adds –Multiply has ~10x latency of adds  Pentium Pro, II, and III –Shift has ~1x latency of adds –Multiply has ~3x latency of adds  Pentium 4 (may change in future releases) –Shift has ~8x latency of adds –Multiply has ~26x latency of adds Under the Covers: P4 Compiler accounts for these differences for you!

for (int i=0;i<length;i++) { p[i] = q[i] * 32; } .B1.7: # -tpp6  movl (%ebx,%edx,4),%eax  shll $5, %eax  movl %eax, (%esi,%edx,4)  incl %edx  cmpl %ecx, %edx  jl.B1.7 .B1.7: # -tpp7  movl (%ebx,%edx,4),%eax  addl %eax, %eax  movl %eax, (%esi,%edx,4)  addl $1, %edx  cmpl %ecx, %edx  jl.B1.7 Under the Covers: Xeon

Which Processor: [a]x? To require at least... UseWindows*Linux* Pentium Pro and Pentium II processors with CMOV and FCMOV instructions iQaxiaxi Pentium processors with MMX instructions MQaxMaxM Pentium III processor with Streaming SIMD Extensions (implies i and M above) KQaxKaxK Pentium 4 processor with Streaming SIMD Extensions 2 (implies i, M and K above) WQaxWaxW

Automatic Processor Dispatch  Single executable –Pentium 4 target that runs on all x86 processors.  For Target Processor it uses: –Processor Specific Opcodes –Prefetch (Pentium III only) –Vectorization  Low Overhead –Some increase in code size  Can mix and match: -xK –axW together makes Xeon/Pentium 4 the target and Pentium III the default

Agenda  General  Xeon™ processor optimizations  Loop level optimizations  Multi-pass optimizations  Other

Vectorization  Automatically converts loops to utilize MMX/SSE/SSE2 instructions and registers.  Data types: char/short/int/float/double –(but not mixed)  Can Use Short Vector Math Library  Enabled through -[Q]xW, -[Q]xK, -[Q]axW, -[Q]axK  -vec_report3 tells you which loops were vectorized, and if not, why not.

High Level Optimizer Windows: /O3 or Linux: -O3Windows: /O3 or Linux: -O3 Use with –xW, -xK, -QxW, -QxK, etc.Use with –xW, -xK, -QxW, -QxK, etc. – additional loop optimizations – more aggressive dependency analysis – scalar replacement – software prefetch (-xK on Pentium III)  Loops must meet criteria related to those for vectorization Under the Covers: Xeon

SMP parallelism  OpenMP –Easy multithreading using directives –Use KSL tools for Development –Use Intel tools to optimize for IA in tandem with OpenMP  Auto-parallelization –Simple loops threaded by compiler alone  Loops must meet certain criteria…

OpenMP* Support  OpenMP 1.1 for Fortran & 1.0 for C / C++ –Debugger info support for OpenMP –Assure for Threads supported with Intel Compiler  OpenMP switches: –-Qopenmp, -openmp (or -openmpP) –-QopenmpS, -openmpS (serial, for debugging) –-openmp_report[n] (diagnostics) – works in conjunction with vectorization

Auto Parallelization  Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directive. –-Qparallel (Windows*), -parallel (Linux*) –-Qpar_report[n], -par_report[n] (diagnostics)  Better to use OpenMP directives – Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze.

Agenda  General and processor optimization  Loop level optimizations  Multi-pass optimizations –Inter Procedural Optimization –Profile Guided Optimization  Other

Inter-Procedural Optimizations (IPO)  -Qip, -ip: Enables interprocedural optimizations for single file compilation.  -Qipo, -ipo: Enables interprocedural optimizations across files.

Inter-Procedural Optimizations (IPO)  More benefits than just inlining –Partial inlining –Interprocedural constant propagation –Passing arguments in registers –Loop-invariant code motion –Dead code elimination –Helps vectorization, memory disambiguation

Pass 1 Pass 2 virtual.obj and.il files executable Compiling: Windows*: icl -c /Qipo main.c func1.c func2.c Linux*: icc -c -ipo main.c func1.c func2.c Linking: Windows*: icl /Qipo main.obj func1.obj func2.obj Linux*: icc -ipo main.obj func1.obj func2.obj IPO Usage: 2 Step Process Windows* Hint: LINK=link.exe should be replaced with LINK=xilink.exe ie: xilink /Qipo main.obj func1.obj func2.obj

 Use execution-time feedback to guide opt  Helps I-cache, paging, branch-prediction  Enabled Optimizations: –Basic block ordering –Better register allocation –Better decision of functions to inline –Function ordering –Switch-statement optimization –Better vectorization decisions Profile-Guided Optimizations (PGO)

Instrumented Compilation Windows: icl /Qprof_gen prog.c Linux: icc -prof_gen prog.c Instrumented Execution prog.exe (on a typical dataset) Feedback Compilation Windows: icl /Qprof_use prog.c Linux: icc -prof_use prog.c DYN file containing dynamic info:.dyn Instrumented Executable: prog.exe Merged DYN Summary File:.dpi Delete old dyn files if you don’t want their info included too Step 1 Step 2 Step 3 PGO Usage: 3 Step Process

 Applications with lots of functions, calls, or branching that are not loop-bound –Examples: Databases, Decision-support (enterprise), MCAD –Apps with computation spread throughout; not confined to kernels  Considerations: –Different paradigm for builds - 3 steps –Schedule time in final stages of development when code is more stable. –Use representative data set(s) (not for corner cases) When To Use PGO

Programs That Benefit  Consistent hot paths  Many if statements or switches  Nested if statements or switches PGO Significant Benefit Little Benefit

Indirect Branches  Indirect Branches not as predictable –Compared with conditional branches –Usually generated for switch statements –Have much larger relative latency than Direct Branches  Intel Compiler does: –Optimizes likely cases to use conditional branches Under the Covers: P4

Agenda  General and processor optimization  Loop level optimizations  Multi-pass optimizations  Other –Float point precision –Math Libraries –Other

Floating Point Precision WindowsLinuxDescription-Op-mp Strict ANSI C and IEEE 754 Floating Point (subset of -Za/-ansi) -Za-Xc Strict ANSI C and IEEE 754 -Qlong_double-long_double long double=80, not the default of 64 -Qprec*-mp1 Precision closer to - but not quite – ANSI ; faster than ANSI -Qprec_div*-prec_div* Turn off - division into reciprocal multiply -Qpcn* -pcn* Round to n precision. n={32,64,80} -Qrcd*-rcd* Remove code that truncates during float to integer conversions * Only available on IA32

Math Libraries  Intel’s LIBM (libimf on Linux)  Short Vector Math Library (SVML) –Used when vectorizing loops which have math functions in them  Automatically used when needed –LIB (windows), LD_LIBRARY_PATH (Linux) environment variables  Common math functions –sin/cos/tan/exp/sqrt/log, etc  Processor dispatch for every IA processor

Libraries on Linux  -i_dynamic link to shared libraries (default)  -static link to static libraries  -shared create a shared object  -Vaxlib link to portability library

Other Switches  More Switches  Pragmas –#pragma IVDEP –hints to compiler that loops are independent and can be vectorized  See Compiler User’s Guide and Reference  icc –help | icl -help   Intel Developer Forum

Summary  Presented the major optimization switches of the Intel Compiler –General Switches –Vectorization & High Level Optimizations –Profile Guided Optimizations –InterProcedural Optimizations  Explained how the Intel Compiler takes advantage of current IA  Optimized PovRay using the Intel Compiler