Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.

Similar presentations


Presentation on theme: "The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05."— Presentation transcript:

1 The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05

2 2 Outline Compiling and Linking. Optimization. Libraries. Debugging. Porting from Seaborg and other systems.

3 3 Pathscale Compilers Default compilers: Pathscale Fortran 90, C, and C++. Module “path” is loaded by default and points to the current default version of the Pathscale compilers (currently 2.2.1). Other versions available: module avail path. Extensive vendor documentation available on-line at http://pathscale.com/docs.html.http://pathscale.com/docs.html Commercial product: well supported and optimized.

4 4 Compiling Code Compiler invocation: –No MPI: pathf90, pathcc, pathCC. –MPI: mpif90, mpicc, mpicxx The mpi compiler invocation will use the currently loaded compiler version. The mpi and non-mpi compiler invocations have the same options and arguments.

5 5 Compiler Optimization Options 4 numeric levels –On where n ranges from 0 (no optimization) to 3. Default level: -O2 (unlike IBM) –g without a –O option changes the default to –O0.

6 6 -O1 Optimization Minimal impact on compilation time compared to –O0 compile. Only optimizations applied to straight line code (basic blocks) like instruction scheduling.

7 7 -O2 Optimization Default when no optimization arguments given. Optimizations that always increase performance. Can significantly increase compilation time. -O2 optimization examples: –Loop nest optimization. –Global optimization within a function scope. –2 passes of instruction scheduling. –Dead code elimination. –Global register allocation.

8 8 -O3 Optimization More extensive optimizations that may in some cases slow down performance. Optimizes loop nests rather than just inner loops, i.e. inverts indices, etc. “Safe” optimizations – produces answers identical with those produced by –O0. NERSC recommendation based on experiences with benchmarks.

9 9 -Ofast Optimization Equivalent to -O3 -ipa -fno-math-errno -OPT:roundoff=2:Olimit=0:div_split=ON:alias=typed. ipa – interprocedural analysis. –Optimizes across functional boundaries. –Must be specified both at compile and link time. Aggressive “unsafe” optimizations: –Changes order of evaluation. –Deviates from IEEE 754 standard to obtain better performance. There are some known problems with this level of optimization in the current release, 2.2.1.

10 10 NAS B Serial Benchmarks Performance (MOP/S) Seaborg Best -O0-O1-O2-O3-Ofast BT 99.6157.2348.1633.6739.8750.9 CG 46.3101.2128.3236.9223.1224.5 EP 3.7 15.1 17.5 21.9 21.8 FT130.1186.2231.5572.4592.7 did not compile IS 5.8 16.9 22.0 25.6 27.0 26.8 LU169.8129.0342.4700.0809.9903.2 MG163.3109.0257.9747.7518.5530.0 SP 78.2104.7225.7507.3462.9516.6

11 11 NAS B Serial Benchmarks Compile Times (seconds) -O0-O1-O2-O3-Ofast BT 2.1 9.0 4.9 9.130.7 CG.4.7.9 1.5 EP.4.5.6.9 FT.4.5.8 1.5 did not compile IS.3.4.7.9 LU 2.1 4.2 5.711.417.4 MG.5.7 1.1 2.2 2.9 SP 1.6 2.0 3.210.014.4

12 12 NAS B Optimization Arguments Used by LNXI Benchmarkers BenchmarkArguments BT-O3 -ipa -WOPT:aggstr=off CG-O3 -ipa -CG:use_movlpd=on -CG:movnti=1 EP-LNO:fission=2 -O3 -LNO:vintr=2 FT-O3 -LNO:opt=0 IS-Ofast -DUSE_BUCKETS LU-Ofast -LNO:fusion=2:prefetch=0:full_unroll=10:ou_max=5 -OPT:ro=3:fold_unsafe_relops=on:fold_unsigned_relops=on: unroll_size=256:unroll_times_max=16:fast_complex -CG:cflow=off:p2align_freq=1 -fno-exceptions MG-O3 -ipa -WOPT:aggstr=off -CG:movnti=0 SP-Ofast

13 13 NAS C FT (32 Proc) OptimizationMops/ProcCompile Time (seconds) Seaborg Best 86.5N/A -O0148.8.7 -O1180.6.9 -O2356.5 1.4 -O3347.4 2.4 -Ofast346.0 3.4

14 14 SuperLU MPI Benchmark Based on the SuperLU general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. Mostly C with some Fortran 90 routines. Run on 64 processors/32 nodes. Uses BLAS routines from ACML.

15 15 SLU (64 procs) OptimizationElapsed run time (seconds) Compile Time (seconds) Seaborg Best 742.5N/A -O0276.7 5.8 -O1241.5 7.1 -O2213.510.6 -O3212.114.6 -OfastN/ADid not compile

16 16 Jacquard Applications Acceptance Benchmarks BenchmarkSeaborgJacquardJacquard Optimizations NAMD (32 proc)2384 sec554-O3 -ipa -fno-exceptions Chombo Serial1036 sec138-O3 -OPT:Ofast -OPT:Olimit=80000 -fno-math-errno -finline Chombo Parallel (32 proc) 773 sec161-O3 -OPT:Ofast -OPT:Olimit=80000 -fno-math-errno -finline CAM Serial1174.4 sec264-O2 CAM Parallel (32 proc) 75 sec13.2-O2 SuperLU (64 proc) 742.5 sec212-O3 -OPT:Ofast -fno-math-errno

17 17 ACML Library AMD Core Math Library - set of numerical routines tuned specifically for AMD64 platform processors. –BLAS –LAPACK –FFT To use with pathscale: –module load acml (built with pathscale compilers) –Compile and link with $ACML To use with gcc: –module load acml_gcc (build with pathscale compilers) –Compile and link with $ACML

18 18 Matrix Multiply Optimization Example 3 ways to multiply 2 dense matrices –Directly in Fortran with nested loops –Matmul F90 intrinsic –dgemm from ACML Example 2 1000 by 1000 double precision matrices. Order of indices: ijk means – do i=1,n – do j=1,n – do k=1,n

19 19 Fortran Matrix Multiply MFLOPs Seaborg Best -O0 -O1 -O2 -O3-Ofast ijk 693 65 117 14816401706 jik 691 63 100 13916401802 ikj 691 53 51 5214711579 kij 691 48 5316191706 jki 691 72 236 59811531802 kji 691 72 183 38515991706 matmul 946 561 554 56416831706 dgemm1310387637633877

20 20 Debugging Etnus Totalview debugger has been installed on the system. Still in testing mode, but it should be available to users soon.

21 21 Porting codes Jacquard is a linux system so gnu tools like gmake are the defaults. Pathscale compilers are good, but new, so please report any evident compiler bugs to consult.


Download ppt "The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05."

Similar presentations


Ads by Google