Download presentation
Presentation is loading. Please wait.
Published byHarriet Barton Modified over 9 years ago
1
Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Parallel Computing Explained Scalar Tuning
2
Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 4.1 Aggressive Compiler Options 4.2 Compiler Optimizations 4.3 Vendor Tuned Code 4.4 Further Information
3
Scalar Tuning If you are not satisfied with the performance of your program on the new computer, you can tune the scalar code to decrease its runtime. This chapter describes many of these techniques: The use of the most aggressive compiler options The improvement of loop unrolling The use of subroutine inlining The use of vendor supplied tuned code The detection of cache problems, and their solution are presented in the Cache Tuning chapter.
4
Aggressive Compiler Options For the SGI Origin2000 Linux clusters the main optimization switch is -On where n ranges from 0 to 3. -O0 turns off all optimizations. -O1 and -O2 do beneficial optimizations that will not effect the accuracy of results. -O3 specifies the most aggressive optimizations. It takes the most compile time, may produce changes in accuracy, and turns on software pipelining.
5
Aggressive Compiler Options It should be noted that –O3 might carry out loop transformations that produce incorrect results in some codes. It is recommended that one compare the answer obtained from Level 3 optimization with one obtained from a lower-level optimization. On the SGI Origin2000 and the Linux clusters, –O3 can be used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3) and –mp (or –mp1), respectively, to enforce operation conformance to IEEE standard at different levels. On the SGI Origin2000, the option -Ofast = ip27 is also available. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer.
6
Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 4.1Aggressive Compiler Options 4.2 Compiler Optimizations 4.2.1 Statement Level 4.2.2 Block Level 4.2.3 Routine Level 4.2.4 Software Pipelining 4.2.5 Loop Unrolling 4.2.6 Subroutine Inlining 4.2.7 Optimization Report 4.2.8 Profile-guided Optimization (PGO) 4.3 Vendor Tuned Code 4.4 Further Information
7
Compiler Optimizations The various compiler optimizations can be classified as follows: Statement Level Optimizations Block Level Optimizations Routine Level Optimizations Software Pipelining Loop Unrolling Subroutine Inlining Each of these are described in the following sections.
8
Statement Level Constant Folding Replace simple arithmetic operations on constants with the pre- computed result. y = 5+7 becomes y = 12 Short Circuiting Avoid executing parts of conditional tests that are not necessary. if (I.eq.J.or. I.eq.K) expression when I=J immediately compute the expression Register Assignment Put frequently used variables in registers.
9
Block Level Dead Code Elimination Remove unreachable code and code that is never executed or used. Instruction Scheduling Reorder the instructions to improve memory pipelining.
10
Routine Level Strength Reduction Replace expressions in a loop with an expression that takes fewer cycles. Common Subexpressions Elimination Expressions that appear more than once, are computed once, and the result is substituted for each occurrence of the expression. Constant Propagation Compile time replacement of variables with constants. Loop Invariant Elimination Expressions inside a loop that don't change with the do loop index are moved outside the loop.
11
Software Pipelining Software pipelining allows the mixing of operations from different loop iterations in each iteration of the hardware loop. It is used to get the maximum work done per clock cycle. Note: On the R10000s there is out-of-order execution of instructions, and software pipelining may actually get in the way of this feature.
12
Loop Unrolling The loops stride (or step) value is increased, and the body of the loop is replicated. It is used to improve the scheduling of the loop by giving a longer sequence of straight line code. An example of loop unrolling follows: Original Loop Unrolled Loop do I = 1, 99 do I = 1, 99, 3 c(I) = a(I) + b(I) enddo c(I+1) = a(I+1) + b(I+1) c(I+2) = a(I+2) + b(I+2) enddo There is a limit to the amount of unrolling that can take place because there are a limited number of registers. On the SGI Origin2000, loops are unrolled to a level of 8 by default. You can unroll to a level of 12 by specifying: f90 -O3 -OPT:unroll_times_max=12... prog.f On the IA32 Linux cluster, the corresponding flag is –unroll and -unroll0 for unrolling and no unrolling, respectively.
13
Subroutine Inlining Subroutine inlining replaces a call to a subroutine with the body of the subroutine itself. One reason for using subroutine inlining is that when a subroutine is called inside a do loop that has a huge iteration count, subroutine inlining may be more efficient because it cuts down on loop overhead. However, the chief reason for using it is that do loops that contain subroutine calls may not parallelize.
14
Subroutine Inlining On the SGI Origin2000 computer, there are several options to invoke inlining: Inline all routines except those specified to -INLINE:never f90 -O3 -INLINE:all … prog.f: Inline no routines except those specified to -INLINE:must f90 -O3 -INLINE:none … prog.f: Specify a list of routines to inline at every call f90 -O3 -INLINE:must=subrname … prog.f: Specify a list of routines never to inline f90 -O3 -INLINE:never=subrname … prog.f: On the Linux clusters, the following flags can invoke function inlining: inline function expansion for calls defined within the current source file -ip: inline function expansion for calls defined in separate files -ipo:
15
Optimization Report Intel 9.x and later compilers can generate reports that provide useful information on optimization done on different parts of your code. To generate such optimization reports in a file filename, add the flag - opt-report-file filename. If you have a lot of source files to process simultaneously, and you use a makefile to compile, you can also use make's "suffix" rules to have optimization reports produced automatically, each with a unique name. For example,.f.o: ifort -c -o $@ $(FFLAGS) -opt-report-file $*.opt $*.f creates optimization reports that are named identically to the original Fortran source but with the suffix ".f" replaced by ".opt".
16
Optimization Report To help developers and performance analysts navigate through the usually lengthy optimization reports, the NCSA program OptView is designed to provide an easy-to-use and intuitive interface that allows the user to browse through their own source code, cross-referenced with the optimization reports. OptView is installed on NCSA's IA64 Linux cluster under the directory /usr/apps/tools/bin. You can either add that directory to your UNIX PATH or you can invoke optview using an absolute path name. You'll need to be using the X-Window system and to have set your DISPLAY environment variable correctly for OptView to work. Optview can provide a quick overview of which loops in a source code or source codes among multiple files are highly optimized and which might need further work. For a detailed description of use of OptView, readers see: http://perfsuite.ncsa.uiuc.edu/OptView/http://perfsuite.ncsa.uiuc.edu/OptView/
17
Profile-guided Optimization (PGO) Profile-guided optimization allows Intel compilers to use valuable runtime information to make better decisions about function inlining and interprocedural optimizations to generate faster codes. Its methodology is illustrated as follows:
18
Profile-guided Optimization (PGO) First, you do an instrumented compilation by adding the -prof-gen flag in the compile process: icc -prof-gen -c a1.c a2.c a3.c icc a1.o a2.o a3.o -lirc Then, you run the program with a representative set of data to generate the dynamic information files given by the. dyn suffix. These files contain valuable runtime information for the compiler to do better function inlining and other optimizations. Finally, the code is recompiled again with the -prof-use flag to use the runtime information. icc -prof-use -ipo -c a1.c a2.c a3.c A profile-guided optimized executable is generated.
19
Vendor Tuned Code Vendor math libraries have codes that are optimized for their specific machine. On the SGI Origin2000 platform, Complib.sgimath and SCSL are available. On the Linux clusters, Intel MKL is available. Ways to link to these libraries are described in Section 3 - Porting Issues.
20
Further Information SGI IRIX man and www pages man opt man lno man inline man ipa man perfex Performance Tuning for the Origin2000 at http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OL D/Doc/ http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OL D/Doc/ Linux clusters help and www pages ifort/icc/icpc –help (Intel) http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/ http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/ http://perfsuite.ncsa.uiuc.edu/OptView/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.