Multi-core Programming Tools. 2 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Programmability Issues
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
1 Lecture 6 Performance Measurement and Improvement.
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Intel Software College.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
Hyper-Threading Intel Compilers Andrey Naraikin Senior Software Engineer Software Products Division Intel Nizhny Novgorod Lab November 29, 2002.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Overview of Intel® Core 2 Architecture and Software Development Tools June 2009.
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Intel Software College.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
1 Intel® Compilers For Xeon™ Processor.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.
Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.
Intel® Composer XE for HPC customers July 2010 Denis Makoshenko, Intel, SSG.
5.3 Machine-Independent Compiler Features
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
History of Microprocessor MPIntroductionData BusAddress Bus
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Performance of mathematical software Agner Fog Technical University of Denmark
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Performance Optimization Getting your programs to run faster.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
DEV490 Easy Multi-threading for Native.NET Apps with OpenMP ™ and Intel ® Threading Toolkit Software Application Engineer, Intel.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
Full and Para Virtualization
Threaded Programming Lecture 2: Introduction to OpenMP.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Single Node Optimization Computational Astrophysics.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Visual C++ Optimizations Jonathan Caves Principal Software Engineer Visual C++ Microsoft Corporation.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Lecture 6: Assembly Programs
Exploiting Parallelism
Optimization for the Linux kernel and Linux OS C. Tyler McAdams
Getting Started with Automatic Compiler Vectorization
Morgan Kaufmann Publishers
Lecture 2: Intro to the simd lifestyle and GPU internals
Henk Corporaal TUEindhoven 2009
Vector Processing => Multimedia
Superscalar Processors & VLIW Processors
Henk Corporaal TUEindhoven 2011
Multi-Core Programming Assignment
Programming with Shared Memory Specifying parallelism
Introduction to Optimization
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Multi-core Programming Tools

2 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core Vectorization

3 General Ideas - Optimization Exploiting Architectural Power requires Sophisticated Compilers Optimal use of – Registers & functional units – Dual-Core/Multi-processor – SSE instructions – Cache architecture

4 General Ideas – Compatibility Compatibility is a key concern for software development. Always read manuals to ensure: Tool compatibility with hardware revisions. Tool compatibility with software revisions. Tool compatibility with IDE. Tool compatibility with native (development) and target (deployment) operating system.

5 General Ideas – Intel C++ Compatibility with Microsoft Source & binary compatible with VC2003 with /Qvc71, Source & binary compatible with w/ VC 2005 under /Qvc8. Microsoft* & Intel OpenMP binaries are not compatible. Use the one compiler for all modules compiled with OpenMP For more information, refer to the User’s Guide

6 General Ideas – Tools Never ignore or take tools for granted. A key part of system development is initial specification and qualification of tools required to get the job done. The wrong tool alone can destroy project success chances.

7 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version General Ideas - Use Intel Compiler in Microsoft IDE C++

8 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core Vectorization

9 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Switches - General Optimizations Windows*Linux*Mac* /Od-O0 Disables optimizations /Zi-g Creates symbols /O1-O1 Optimize for Binary Size: Server Code /O2-O2 Optimizes for speed (default) /O3-O3 Optimize for Data Cache: Loopy Floating Point Code

10 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Switches - Multi-pass Optimization Interprocedural Optimizations (IPO) ip: Enables interprocedural optimizations for single file compilation ipo: Enables interprocedural optimizations across files Can inline functions in separate files Enhances optimization when used in combination with other compiler features Windows*Linux*Mac* /Qip-ip /Qipo-ipo

11 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Switches - Multi-pass Optimization - IPO Usage: Two-Step Process Linking Windows*icl /Qipo main.o func1.o func2.o Linux*icc -ipo main.o func1.o func2.o Mac*icc -ipo main.o func1.o func2.o Pass 1 Pass 2 virtual.o executable Compiling Windows*icl -c /Qipo main.c func1.c func2.c Linux*icc -c -ipo main.c func1.c func2.c Mac*icc -c -ipo main.c func1.c func2.c

12 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Switches - Profile Guided Optimizations (PGO) Use execution-time feedback to guide many other compiler optimizations Helps I-cache, paging, branch-prediction Enabled optimizations: – Basic block ordering – Better register allocation – Better decision of functions to inline – Function ordering – Switch-statement optimization – Better vectorization decisions

13 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Instrumented Compilation (Mac*/Linux*)icc -prof_gen[x] prog.c (Windows*)icl -Qprof_gen[x] prog.c Instrumented Execution Run program on a typical dataset Feedback Compilation (Mac/Linux)icc -prof_use prog.c (Windows)icl -Qprof_use prog.c DYN file containing dynamic info:.dyn Instrumented executable Merged DYN summary file:.dpi Delete old dyn files if you do not want the info included Step 1 Step 2 Step 3 Compiler Switches - Multi-pass Optimization PGO: Three-Step Process

14 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core Auto Parallelization OpenMP Threading Diagnostics Vectorization

15 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Auto-parallelization Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives. – Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze. Windows*Linux*Mac* /Qparallel-parallel /Qpar_report[n]-par_report[n]

16 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version OpenMP* Threading Technology Pragma based approach to parallelism Usage: – OpenMP switches: -openmp : /Qopenmp – OpenMP reports: - openmp-report : /Qopenmp-report #pragma omp parallel for for (i=0;i<MAX;i++) A[i]= c*A[i] + B[i];

17 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version OpenMP: Workqueueing Extension Example Intel Compiler’s Workqueuing extension – Create Queue of tasks…Works on… Recursive functions Linked lists, etc. #pragma intel omp parallel taskq shared(p) { while (p != NULL) { #pragma intel omp task captureprivate(p) do_work1(p); p = p->next; }

18 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Parallel Diagnostics Source Instrumentation for Intel Thread Checker Allows thread checker to diagnose threading correctness bugs To use tcheck/Qtcheck you must have Intel Thread Checker installed Windows*Linux*Mac* /Qtcheck-tcheckNo support

19 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core Vectorization SSE & Vectorization Vectorization Reports Explanations of a few specific vectorization inhibitors

20 SIMD – SSE, SSE2, SSE3 Support 16x bytes 8x words 4x dwords 2x qwords 1x dqword 4x floats 2x doubles MMX* SSE SSE2 SSE3 * MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers

21 SIMD FP using AOS format* Thread Synchronization Video encoding Complex arithmetic FP to integer conversions HADDPD, HSUBPD HADDPS, HSUBPS MONITOR, MWAIT LDDQU ADDSUBPD, ADDSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP FISTTP * Also benefits Complex and Vectorization SSE3 Instructions

22 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Using SSE3 - Your Task: Convert This… 128-bit Registers A[0] B[0] C[0] A[1] B[1] C[1] not used for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

23 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version … Into This … 128-bit Registers A[3] A[2] B[3] B[2] C[3] C[2] + + A[1] A[0] B[1] B[0] C[1] C[0] + + for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

24 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Based Vectorization Processor Specific DescriptionUseWindows*Linux*Mac* Generate instructions and optimize for Intel ® Pentium ® 4 compatible processors including MMX, SSE and SSE2. W/QxW-xWDoes not apply Generate instructions and optimize for Intel ® processors with SSE3 capability including Core Duo. These processors support SSE3 as well as MMX,SSE and SSE2. P/QxP /QaxP -xP, -axP Vector- ization occurs by default

25 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Based Vectorization Automatic Processor Dispatch – ax[?] Single executable – Optimized for Intel® Core Duo processors and generic code that runs on all IA32 processors. For each target processor it uses: – Processor-specific instructions – Vectorization Low overhead – Some increase in code size

26 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Why Loops Don’t Vectorize Independence – Loop Iterations generally must be independent Some relevant qualifiers: – Some dependent loops can be vectorized. – Most function calls cannot be vectorized. – Some conditional branches prevent vectorization. – Loops must be countable. – Outer loop of nest cannot be vectorized. – Mixed data types cannot be vectorized.

27 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Why Didn’t My Loop Vectorize? Windows* Linux* Macintosh* -Qvec_reportn-vec_reportn-vec_reportn Set diagnostic level dumped to stdout – n=0: No diagnostic information – n=1: (Default) Loops successfully vectorized – n=2: Loops not vectorized – and the reason why not – n=3: Adds dependency Information – n=4: Reports only non-vectorized loops – n=5: Reports only non-vectorized loops and adds dependency info

28 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Why Loops Don’t Vectorize – “Existence of vector dependence” – “Nonunit stride used” – “Mixed Data Types” – “Unsupported Loop Structure” – “Contains unvectorizable statement at line XX” – There are more reasons loops don’t vectorize but we will disucss the reasons above

29 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Existence of Vector Dependency” Usually, indicates a real dependency between iterations of the loop, as shown here: for (i = 0; i < 100; i++) x[i] = A * x[i + 1];

30 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Defining Loop Independence Iteration Y of a loop is independent of when (or whether) iteration X occurs. int a[MAX], b[MAX]; for (j=0;j<MAX;j++) { a[j] = b[j]; }

31 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Nonunit stride used” for (I=0;I<=MAX;I++) for (J=0;J<=MAX;J++) { c[I][J]+=1; // Unit Stride c[J][I]+=1; // Non-Unit A[J*J]+=1; // Non-unit A[B[J]]+=1; // Non-Unit if (A[MAX-J])=1 last1=J;}// Non-Unit End Result: Loading Vector may take more cycles than executing operation sequentially. Memory

32 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Mixed Data Types” An example: int howmany_close(double *x, double *y) { int withinborder=0; double dist; for(int i=0;i<MAX;i++) { dist=sqrtf(x[i]*x[i] + y[i]*y[i]); if (dist<5) withinborder++; } Mixed data types are possible – but complicate things i.e.: 2 doubles vs 4 ints per SIMD register Some operations with specific data types won’t work

33 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Unsupported Loop Structure” Example: struct _xx { int data; int bound; } ; doit1(int *a, struct _xx *x) { for (int i=0; i bound; i++) a[i] = 0; An unsupported loop structure means the loop is not countable, or the compiler for whatever reason can’t construct a run-time expression for the trip count.

34 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Contains unvectorizable statement” for (i=1;i<nx;i++) { B[i] = func(A[i]); } 128-bit Registers A[3] A[2] B[3] B[2] func A[1] A[0] B[1] B[0] func

35 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Reference White papers and technical notes – – Product support resources –