Dr. Ilia Bermous, the Australian Bureau of Meteorology

Slides:



Advertisements
Similar presentations
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology On the efficient tracking of grid point.
Advertisements

Optimizing Expression Selection for Lookup Table Program Transformation Chris Wilcox, Michelle Mills Strout, James M. Bieman Computer Science Department.
Eos Compilers Fernanda Foertter HPC User Assistance Specialist.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
FORTRAN.  Fortran or FORmula TRANslation was developed in the 1950's by IBM as an alternative to Assembly Language. First successfull high level language.
Lecture 3: Computer Performance
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
Getting Reproducible Results with Intel® MKL 11.0
Advanced Computer Architectures
Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
Computer Performance Computer Engineering Department.
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology makebc performance Ilia Bermous 21 June.
Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with a similar.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Experimental RELAP5-3D Time Step Improvements Dr. George L Mesina RELAP5 International Users Seminar Nov 18-20, 2008 Idaho Falls, ID.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
An OBSM method for Real Time Embedded Systems Veronica Eyo Sharvari Joshi.
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)
Three State Air Quality Study (3SAQS) Benchmarking for 3SAQS base 2008 CAMx simulations.
Morgan Kaufmann Publishers
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Introduction to the XC30 For Users of Cray’s XT5 and XK7 Aaron Vose 1.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Performance Performance
HPC F ORUM S EPTEMBER 8-10, 2009 Steve Rowan srowan at conveycomputer.com.
MA/CS 375 Fall 2002 Lecture 2. Motivation for Suffering All This Math and Stuff Try the Actor demo from
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio
Plans for the National NERC HPC services UM vn 6.1 installations and performance UM vn 6.6 and NEMO(?) plans.
Lecture 10 CUDA Instructions
PERFORMANCE EVALUATIONS
Computer Architecture & Operations I
Unit 3: ALGORITHMS AND FLOWCHARTS
4D-VAR Optimization Efficiency Tuning
Getting Started with Automatic Compiler Vectorization
Types of Concurrent Events
Indranil Roy High Performance Computing (HPC) group
Performance and Code Tuning
MATLAB DENC 2533 ECADD LAB 9.
Parallel Implementations of Ensemble Kalman Filters for Huge Geophysical Models Jeffrey Anderson, Helen Kershaw, Jonathan Hendricks, Nancy Collins, Ye.
Scalar and Serial Optimization
ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations
Programming Languages
Multivector and SIMD Computers
Back End Compiler Panel
Morgan Kaufmann Publishers Computer Performance
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Multi-Core Programming Assignment
Lecture 2 The Art of Concurrency
Types of Concurrent Events
Performance and Code Tuning Overview
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

Dr. Ilia Bermous, the Australian Bureau of Meteorology Performance, accuracy and bit-reproducibility aspects in handling transcendental functions with Cray and Intel compilers for the Met Office Unified Model Dr. Ilia Bermous, the Australian Bureau of Meteorology Acknowledgements to Dr. Martyn Corden (Intel), Dr. Zhang Zhang (Intel), Dr. Martin Dix (CSIRO)

Some requirements for meteorological applications Performance vectorisation human factor: time spent to achieve a certain level in the performance gain Accuracy single & double precision in weather forecasting models Results bit-reproducibility Note: all above mentioned aspects are applied for handling transcendental functions in an application and will be discussed in the presentation Performance: to minimise an application run time on the available resources. On modern Intel processors, vector instructions may speed up arithmetic instructions by up to a factor of 16. Also it is very important that a computational code must be designed in a way that makes vector processing possible. Accuracy - a must condition: convergence of the numerical algorithm - produced numerical results are acceptable from the scientific point of view Recent studies includes the usage of single precision in the ECMWF IFS spectral model ("Single Precision in Weather Forecasting Models: An Evaluation with the IFS" at http://journals.ametsoc.org/doi/abs/10.1175/MWR-D-16-0228.1).

Vector processing: Fortran compiler options related to the architecture Intel compiler: -xcode for specific processor architecture AVX – Intel Xeon processor E3/E5 and E3/E5/E7 v2 family CORE-AVX2 - Intel Xeon processor E3/E5/E7 v3, v4 family CORE-AVX512 – Intel Xeon Processor Scalable Family Host – architecture on which the code is compiled Cray compiler: -h cpu=target_system Compiler options set via modules on our XC-40 systems are Module Cray compiler Intel compiler craype-sandybridge -hcpu=sandybridge -mavx craype-haswell -hcpu=haswell -xCORE-AVX2 craype-broadwell -hcpu=broadwell Intel Xeon Processor Scalable Family has now been launched

Results bit-reproducibility Means getting identical results on a job rerun Benefits: useful in testing, debugging, tracking down code bugs or numerical instability Numerical implementation of UM includes a setting which provides results bit-reproducibility using different number of threads for the same horizontal decomposition

Rerun bit-reproducibility Intel Fortran compiler => need to specify -fp-model precise may effect vectorisation of loops for some transcendental functions called within the loops Note: below any reference to a compiler means Fortran compiler -fp-model precise is used in the Intel compiler case Cray compiler : results bit-reproducibility may depend on -h flex_mp=level "-fp-model precise" disables optimizations that are not value-safe on floating-point data and rounds intermediate results to source-defined precision

Accuracy in calculation of transcendental functions Accuracy is measured in terms of ULP ( unit in the last place or unit of least precision ) accuracy for Cray CCE math libraries is controlled by –hfpN option Note: –hfp4 is used to compile the majority of UM files Intel compiler Cray compiler default -fp-model precise -hfp0|-hfp1|-hfp2 -hfp3 -hfp4 4 ULP 0.6 ULP 1.3 ULP 2.6 ULP The accuracies quoted for the Intel compiler are the maximum deviations from the exact result.

Vectorisation information for some transcendental functions in loops With the Cray compiler and the Intel compiler with -fp-model precise, loops are vectorised with SQRT, SIN, COS, EXP NOT vectorised with ASIN, ACOS, ATAN Power (x^y) ALOG ALOG10 TANH ATAN2 Cray compiler Yes No Intel compiler

Can loops for all transcendental functions be vectorised? Cray compiler: NO (as far as I know) Intel compiler: YES in two ways -fast-transcendentals -fimf-precision=<value> <value> = high | medium | low high - 1 ULP, used in this presentation medium - 4 ULP, also tested with very similar performance low – not used Intel Math Kernel Library (MKL) library using vector math library (the error usually is <0.75 ULP with high accuracy) The <value> parameter lets you specify a level of accuracy (precision) that the compiler should use when determining which math library functions to use. The required accuracy with the MKL library usage is handled by the the MKL_VML_MODE environment variable. A VML_HA value for MKL_VML_MODE corresponds to high accuracy. The accuracies quoted for the Intel compiler are the maximum deviations from the exact result. MKL calls are optimized for very large trip counts, unlike the compiler.

Environment for implementation Hardware Cray XC-40 (Terra), Intel Xeon E5-2690 v3, 2.6 GHz 136 nodes with 24 cores/node (3264 cores) Software Intel compiler v17.0.1.132 Cray cce 8.4.5 Cray MPICH 7.3.2 MPI_WTIME – for time measurements

UM job description UM10.7 sources (Feb 2017) N768L70 (1532x1152x70 grid size) global model 24 hour (192 time steps) forecast with switched off I/O to tune computational capability of the model sources run configuration: 16x30 - horizontal decomposition with 2 threads on 960 cores elapsed time for the job is 920-950 sec Note: the job was used within the OpenMP coverage improvement collaboration project with the Met Office for UM10.4-10.8 releases

Example: departure_point_eta_mod.F90 Subroutine includes 8 three-level nested loops: 251 DO k = 1, model_levels 252 DO j = udims%j_start, udims%j_end 253 DO i = udims%i_start, udims%i_end . . . 276 z_d = SQRT(1.0 - x_d**2 - y_d**2) 277 278 depart_xi1(i,j,k) = xi1_u(i) + & 279 ATAN2(x_d,z_d*csxi2_p(j)-y_d*snxi2_p(j)) 280 depart_xi2(i,j,k) = ASIN(y_d*csxi2_p(j)+z_d*snxi2_p(j)) 281 depart_xi3(i,j,k) = eta_rho_levels(k) - timestep*w_a 282 END DO 283 END DO 284 END DO

Cray compiler loopmark output for the loop Loopmark output with –rm compiler option 251. + M m----------< DO k = 1, model_levels 252. + M m 3--------< DO j = udims%j_start, udims%j_end 253. M m 3 Vp-----< DO i = udims%i_start, udims%i_end . . . 276. M m 3 Vp z_d = SQRT(1.0 - x_d**2 - y_d**2) 277. M m 3 Vp 278. M m 3 Vp depart_xi1(i,j,k) = xi1_u(i) + & 279. M m 3 Vp ATAN2(x_d,z_d*csxi2_p(j)-y_d*snxi2_p(j)) 280. M m 3 Vp depart_xi2(i,j,k) = ASIN(y_d*csxi2_p(j)+z_d*snxi2_p(j)) 281. M m 3 Vp depart_xi3(i,j,k) = eta_rho_levels(k) - timestep*w_a 282. M m 3 Vp-----> END DO 283. M m 3--------> END DO 284. M m----------> END DO A loop starting at line 253 was partially vectorized. Comments: the inner loop is only partially vectorised the loopmark output does not provide details on what is not vectorised Loopmark legends: M – Multithreaded, m – partitioned, V – vectorised, p – partial.

Intel compiler diagnostic for the loop -fp-model precise -qopt-report=5 -qopt-report-phase=vec LOOP BEGIN at <path>/departure_point_eta_mod.F90(253,9) remark #15382: vectorization support: call to function atan2 cannot be vectorized [ <path>/departure_point_eta_mod.F90(279,31) ] remark #15382: vectorization support: call to function asin cannot be vectorized [ <path>/departure_point_eta_mod.F90(280,31) ] remark #15344: loop was not vectorized: vector dependence prevents vectorization LOOP END -fp-model precise -fast-transcendentals -fimf-precision=high -qopt-report=5 -qopt-report-phase=vec . . . remark #15305: vectorization support: vector length 4 remark #15309: vectorization support: normalized vectorization overhead 0.303 remark #15300: LOOP WAS VECTORIZED remark #15476: scalar cost: 413 remark #15477: vector cost: 107.250 remark #15478: estimated potential speedup: 3.690 remark #15482: vectorized math library calls: 2

Time measurements for departure_point_eta_mod loops 𝑻= – average time taken by MPI process ti – total time for all 8 loops taken by MPI process with rank #i (the job is run on 480 MPI processes) Compilation options for departure_point_eta_mod.F90 Cray compiler: -hcpu=haswell -O3 -hvector3 -hscalar3 -hfp4 -hcache3 -haggress -hnocontiguous –hconcurrent -hflex_mp=default

Time measurements for departure_point_eta_mod loops (cont #2) Intel compiler STANDARD -align array64byte -qopenmp -O3 -fp-model precise -xavx (1) FAST_TRANS (1) "+" -fast-transcendentals -fimf-precision=high (2) FAST_TRANS+AVX2 (2) with replacement –xavx => -xCORE-AVX2 (3)

Time measurements for departure_point_eta_mod loops (cont #3) Conclusions: usage of -fast-transcendentals -fimf-precision=high reduces the elapsed time by over 2 times the shortest elapsed time with the Intel compiler is 1.75 times better than with the Cray compiler minor speed up from AVX2 Cray Intel STANDARD FAST_TRANS FAST_TRANS+AVX2 49.7 sec 63.1 sec 29.8 sec 28.4 sec

Usage of MKL in UM Frequently called transcendental functions in the UM sources are managed by the following procedure: SUBROUTINE exp_v(n,x,y) . . . #if defined(MKL) CALL vdexp(n, x, y) #else DO i=1, n y(i) = EXP(x(i)) END DO #endif END SUBROUTINE exp_v Similar routines are available in UM for EXP, SQRT, LOG, SIN, COS, ASIN, ACOS, . . .

Performance results with Intel compiler Build #1: STANDARD optimised build Build #2: FAST_TRANS build for all model files Build #3: FAST_TRANS build for all model files "+" pre-processor options are set to use the MKL library (-DMKL) Conclusions: fast-transcendentals provide a reduction in the elapsed time for the job by ~5% with a relatively minor effort MKL has no additional benefit STANDARD FAST_TRANS FAST_TRANS + MKL 918 sec 874 sec 877 sec MKL calls are optimized for very large trip counts, unlike the compiler.

Performance results comparison for Intel and Cray compilers Using the same run environment Intel compiler => 874 sec Cray compiler => 952 sec with Cray compiler options set at the Met Office with enabled inter-procedural optimisation Conclusion: elapsed time with the Intel compiler is 9% better than with the Cray compiler

Conclusions Intel compiler using fast-transcendentals and MKL library vectorisation of some loops is improved relatively good performance gains can be achieved factor of 2 speed-up for SL departure point calculations loops 5% improvement for full model relatively minor effort required Cray compiler at this stage it is not clear on whether a similar performance benefits are achievable with the compiler

Thank you