Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Slides:



Advertisements
Similar presentations
CSC 4181 Compiler Construction Code Generation & Optimization.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Rational Apex 4.0 Optimization “Beware the benchmark!”
Compiler Challenges for High Performance Architectures
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.
Reduction in Strength CS 480. Our sample calculation for i := 1 to n for j := 1 to m c [i, j] := 0 for k := 1 to p c[i, j] := c[i, j] + a[i, k] * b[k,
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Optimization Compiler Optimization – optimizes the speed of compilation Execution Optimization – optimizes the speed of execution.
Chapter 6. 2 Objectives You should be able to describe: Function and Parameter Declarations Returning a Single Value Pass by Reference Variable Scope.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Intermediate Code. Local Optimizations
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.
Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
What’s in an optimizing compiler?
A First Book of C++: From Here To There, Third Edition2 Objectives You should be able to describe: Function and Parameter Declarations Returning a Single.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
EEC4133 Computer Organization & Architecture Chapter 6: Languages and the Machine by Muhazam Mustapha, May 2014.
Performance Optimization Getting your programs to run faster CS 691.
Chapter 25: Code-Tuning Strategies. Chapter 25  Code tuning is one way of improving a program’s performance, You can often find other ways to improve.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
High-Level Transformations for Embedded Computing
Supercomputing and Science An Introduction to High Performance Computing Part IV: Dependency Analysis and Stupid Compiler Tricks Henry Neeman, Director.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
Makefiles Manolis Koubarakis Data Structures and Programming Techniques 1.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Code Optimization Overview and Examples
Code Optimization.
Computer Architecture Principles Dr. Mike Frank
Optimization Code Optimization ©SoftMoore Consulting.
Princeton University Spring 2016
Optimizing Transformations Hal Perkins Autumn 2011
Optimizing Transformations Hal Perkins Winter 2008
Compiler Code Optimizations
Code Optimization Overview and Examples Control Flow Graph
Instruction Level Parallelism (ILP)
Parallel Computing Explained How to Parallelize a Code
Lecture 19: Code Optimisation
Programming with Shared Memory Specifying parallelism
Loop-Level Parallelism
Introduction to Optimization
Presentation transcript:

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Parallel Computing Explained Scalar Tuning

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 4.1 Aggressive Compiler Options 4.2 Compiler Optimizations 4.3 Vendor Tuned Code 4.4 Further Information

Scalar Tuning If you are not satisfied with the performance of your program on the new computer, you can tune the scalar code to decrease its runtime. This chapter describes many of these techniques: The use of the most aggressive compiler options The improvement of loop unrolling The use of subroutine inlining The use of vendor supplied tuned code The detection of cache problems, and their solution are presented in the Cache Tuning chapter.

Aggressive Compiler Options For the SGI Origin2000 Linux clusters the main optimization switch is -On where n ranges from 0 to 3. -O0 turns off all optimizations. -O1 and -O2 do beneficial optimizations that will not effect the accuracy of results. -O3 specifies the most aggressive optimizations. It takes the most compile time, may produce changes in accuracy, and turns on software pipelining.

Aggressive Compiler Options It should be noted that –O3 might carry out loop transformations that produce incorrect results in some codes. It is recommended that one compare the answer obtained from Level 3 optimization with one obtained from a lower-level optimization. On the SGI Origin2000 and the Linux clusters, –O3 can be used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3) and –mp (or –mp1), respectively, to enforce operation conformance to IEEE standard at different levels. On the SGI Origin2000, the option -Ofast = ip27 is also available. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer.

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 4.1Aggressive Compiler Options 4.2 Compiler Optimizations Statement Level Block Level Routine Level Software Pipelining Loop Unrolling Subroutine Inlining Optimization Report Profile-guided Optimization (PGO) 4.3 Vendor Tuned Code 4.4 Further Information

Compiler Optimizations The various compiler optimizations can be classified as follows: Statement Level Optimizations Block Level Optimizations Routine Level Optimizations Software Pipelining Loop Unrolling Subroutine Inlining Each of these are described in the following sections.

Statement Level Constant Folding Replace simple arithmetic operations on constants with the pre- computed result. y = 5+7 becomes y = 12 Short Circuiting Avoid executing parts of conditional tests that are not necessary. if (I.eq.J.or. I.eq.K) expression when I=J immediately compute the expression Register Assignment Put frequently used variables in registers.

Block Level Dead Code Elimination Remove unreachable code and code that is never executed or used. Instruction Scheduling Reorder the instructions to improve memory pipelining.

Routine Level Strength Reduction Replace expressions in a loop with an expression that takes fewer cycles. Common Subexpressions Elimination Expressions that appear more than once, are computed once, and the result is substituted for each occurrence of the expression. Constant Propagation Compile time replacement of variables with constants. Loop Invariant Elimination Expressions inside a loop that don't change with the do loop index are moved outside the loop.

Software Pipelining Software pipelining allows the mixing of operations from different loop iterations in each iteration of the hardware loop. It is used to get the maximum work done per clock cycle. Note: On the R10000s there is out-of-order execution of instructions, and software pipelining may actually get in the way of this feature.

Loop Unrolling The loops stride (or step) value is increased, and the body of the loop is replicated. It is used to improve the scheduling of the loop by giving a longer sequence of straight line code. An example of loop unrolling follows: Original Loop Unrolled Loop do I = 1, 99 do I = 1, 99, 3 c(I) = a(I) + b(I) enddo c(I+1) = a(I+1) + b(I+1) c(I+2) = a(I+2) + b(I+2) enddo There is a limit to the amount of unrolling that can take place because there are a limited number of registers. On the SGI Origin2000, loops are unrolled to a level of 8 by default. You can unroll to a level of 12 by specifying: f90 -O3 -OPT:unroll_times_max=12... prog.f On the IA32 Linux cluster, the corresponding flag is –unroll and -unroll0 for unrolling and no unrolling, respectively.

Subroutine Inlining Subroutine inlining replaces a call to a subroutine with the body of the subroutine itself. One reason for using subroutine inlining is that when a subroutine is called inside a do loop that has a huge iteration count, subroutine inlining may be more efficient because it cuts down on loop overhead. However, the chief reason for using it is that do loops that contain subroutine calls may not parallelize.

Subroutine Inlining On the SGI Origin2000 computer, there are several options to invoke inlining: Inline all routines except those specified to -INLINE:never f90 -O3 -INLINE:all … prog.f: Inline no routines except those specified to -INLINE:must f90 -O3 -INLINE:none … prog.f: Specify a list of routines to inline at every call f90 -O3 -INLINE:must=subrname … prog.f: Specify a list of routines never to inline f90 -O3 -INLINE:never=subrname … prog.f: On the Linux clusters, the following flags can invoke function inlining: inline function expansion for calls defined within the current source file -ip: inline function expansion for calls defined in separate files -ipo:

Optimization Report Intel 9.x and later compilers can generate reports that provide useful information on optimization done on different parts of your code. To generate such optimization reports in a file filename, add the flag - opt-report-file filename. If you have a lot of source files to process simultaneously, and you use a makefile to compile, you can also use make's "suffix" rules to have optimization reports produced automatically, each with a unique name. For example,.f.o: ifort -c -o $(FFLAGS) -opt-report-file $*.opt $*.f creates optimization reports that are named identically to the original Fortran source but with the suffix ".f" replaced by ".opt".

Optimization Report To help developers and performance analysts navigate through the usually lengthy optimization reports, the NCSA program OptView is designed to provide an easy-to-use and intuitive interface that allows the user to browse through their own source code, cross-referenced with the optimization reports. OptView is installed on NCSA's IA64 Linux cluster under the directory /usr/apps/tools/bin. You can either add that directory to your UNIX PATH or you can invoke optview using an absolute path name. You'll need to be using the X-Window system and to have set your DISPLAY environment variable correctly for OptView to work. Optview can provide a quick overview of which loops in a source code or source codes among multiple files are highly optimized and which might need further work. For a detailed description of use of OptView, readers see:

Profile-guided Optimization (PGO) Profile-guided optimization allows Intel compilers to use valuable runtime information to make better decisions about function inlining and interprocedural optimizations to generate faster codes. Its methodology is illustrated as follows:

Profile-guided Optimization (PGO) First, you do an instrumented compilation by adding the -prof-gen flag in the compile process: icc -prof-gen -c a1.c a2.c a3.c icc a1.o a2.o a3.o -lirc Then, you run the program with a representative set of data to generate the dynamic information files given by the. dyn suffix. These files contain valuable runtime information for the compiler to do better function inlining and other optimizations. Finally, the code is recompiled again with the -prof-use flag to use the runtime information. icc -prof-use -ipo -c a1.c a2.c a3.c A profile-guided optimized executable is generated.

Vendor Tuned Code Vendor math libraries have codes that are optimized for their specific machine. On the SGI Origin2000 platform, Complib.sgimath and SCSL are available. On the Linux clusters, Intel MKL is available. Ways to link to these libraries are described in Section 3 - Porting Issues.

Further Information SGI IRIX man and www pages man opt man lno man inline man ipa man perfex Performance Tuning for the Origin2000 at D/Doc/ D/Doc/ Linux clusters help and www pages ifort/icc/icpc –help (Intel) (Intel64) (Intel64)