Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam.

Slides:



Advertisements
Similar presentations
Weather Research & Forecasting: A General Overview
Advertisements

A NUMERICAL PREDICTION OF LOCAL ATMOSPHERIC PROCESSES A.V.Starchenko Tomsk State University.
Profiling your application with Intel VTune at NERSC
Intel® performance analyze tools Nikita Panov Idrisov Renat.
The Role of High-value Observations for Forecast Simulations in a Multi- scale Climate Modeling Framework Gabriel J. Kooperman, Michael S. Pritchard, and.
June 2003Yun (Helen) He1 Coupling MM5 with ISOLSM: Development, Testing, and Application W.J. Riley, H.S. Cooley, Y. He*, M.S. Torn Lawrence Berkeley National.
Toward Improving Representation of Model Microphysics Errors in a Convection-Allowing Ensemble: Evaluation and Diagnosis of mixed- Microphysics and Perturbed.
An intraseasonal moisture nudging experiment in a tropical channel version of the WRF model: The model biases and the moisture nudging scale dependencies.
Shortwave Radiation Options in the WRF Model
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
1 Lecture 6 Performance Measurement and Improvement.
Nesting. Eta Model Hybrid and Eta Coordinates ground MSL ground Pressure domain Sigma domain  = 0  = 1  = 1 Ptop  = 0.
NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.
Advance the understanding and the prediction of mesoscale precipitation systems and to promote closer ties between the research and operational forecasting.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
CMAQ (Community Multiscale Air Quality) pollutant Concentration change horizontal advection vertical advection horizontal dispersion vertical diffusion.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
Incorporation of TAMDAR into Real-time Local Modeling Tom Hultquist Science & Operations Officer NOAA/National Weather Service Marquette, MI.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
Development of WRF-CMAQ Interface Processor (WCIP)
Mesoscale Modeling Review the tutorial at: –In class.
– Equations / variables – Vertical coordinate – Terrain representation – Grid staggering – Time integration scheme – Advection scheme – Boundary conditions.
Atmospheric Modeling in an Arctic System Model John J. Cassano Cooperative Institute for Research in Environmental Sciences and Department of Atmospheric.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Russ Bullock 11 th Annual CMAS Conference October 17, 2012 Development of Methodology to Downscale Global Climate Fields to 12km Resolution.
– Equations / variables – Vertical coordinate – Terrain representation – Grid staggering – Time integration scheme – Advection scheme – Boundary conditions.
Petascale workshop 2013 Judit Gimenez Detailed evolution of performance metrics Folding.
Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.
Non-hydrostatic Numerical Model Study on Tropical Mesoscale System During SCOUT DARWIN Campaign Wuhu Feng 1 and M.P. Chipperfield 1 IAS, School of Earth.
Higher Resolution Operational Models. Operational Mesoscale Model History Early: LFM, NGM (history) Eta (mainly history) MM5: Still used by some, but.
WRF Volcano modelling studies, NCAS Leeds Ralph Burton, Stephen Mobbs, Alan Gadian, Barbara Brooks.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Earth-Sun System Division National Aeronautics and Space Administration SPoRT SAC Nov 21-22, 2005 Regional Modeling using MODIS SST composites Prepared.
Kelvin K. Droegemeier and Yunheng Wang Center for Analysis and Prediction of Storms and School of Meteorology University of Oklahoma 19 th Conference on.
14 th Annual WRF Users’ Workshop. June 24-28, 2013 Improved Initialization and Prediction of Clouds with Satellite Observations Tom Auligné Gael Descombes,
Soil moisture generation at ECMWF Gisela Seuffert and Pedro Viterbo European Centre for Medium Range Weather Forecasts ELDAS Interim Data Co-ordination.
Towards a Computationally Bound Numerical Weather Prediction Model Daniel B. Weber, Ph.D. Software Group - 76 SMXG Tinker Air Force Base October 6, 2008.
Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.
Higher Resolution Operational Models. Major U.S. High-Resolution Mesoscale Models (all non-hydrostatic ) WRF-ARW (developed at NCAR) NMM-B (developed.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
10 th COSMO General Meeting, Krakow, September 2008 Recent work on pressure bias problem Lucio TORRISI Italian Met. Service CNMCA – Pratica di Mare.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
Norman Weather Forecast Office Gabe Garfield 2/23/11.
Single Node Optimization Computational Astrophysics.
Update on the 2-moment stratiform cloud microphysics scheme in CAM Hugh Morrison and Andrew Gettelman National Center for Atmospheric Research CCSM Atmospheric.
Higher Resolution Operational Models
A modeling study of cloud microphysics: Part I: Effects of Hydrometeor Convergence on Precipitation Efficiency. C.-H. Sui and Xiaofan Li.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
1 The Interactions of Aerosols, Clouds, and Radiation on the Regional Scale.
Northwest Modeling Consortium
An accurate, efficient method for calculating hydrometeor advection in multi-moment bulk and bin microphysics schemes Hugh Morrison (NCAR*) Thanks to:
R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde
Parameterization of Cloud Microphysics Based on the Prediction of Bulk Ice Particle Properties. Part II: Case Study Comparisons with Observations and Other.
Geant4 MT Performance Soon Yung Jun (Fermilab)
Characterization of Parallel Scientific Simulations
IMPROVING HURRICANE INTENSITY FORECASTS IN A MESOSCALE MODEL VIA MICROPHYSICAL PARAMETERIZATION METHODS By Cerese Albers & Dr. TN Krishnamurti- FSU Dept.
CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph
Update on the Northwest Regional Modeling System 2017
Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:
Update on CAM and the AMWG. Recent activities and near-term priorities. by the AMWG.
Sensitivity of WRF microphysics to aerosol concentration
Department of Computer Science, University of Tennessee, Knoxville
Scott A. Braun, 2002: Mon. Wea. Rev.,130,
Reporter : Prudence Chien
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam Elliot3,2,and Srinath Vadlamani4 1 University of Iowa 2National Center for atmospheric Research(NCAR) 3University of Colorado at Boulder 4Paratools Inc

Outline Introduction WRF MPI Scalability Hybrid Parallelization Profiling WRF Intel Vtune Amplifier XE Taul tools Identifying hotspots and suggested areas for improvement

The Weather Research & Forecasting(WRF) Model Numerical weather prediction system Designed for both operational forecasting and atmospheric research Community model with large user base: More than 30,000 users in 150 countries Figure from WRF-ARW Technical Note

Previous Scaling Studies WRF has benchmarked on different systems. Figures from cisl.ucar.edu

TACC Stampede Supercomuter Aggregate Peak Performance : ~10 PFLOPS(PF) 6400+ Dell PowerEdge (C8220z) server nodes 2 Intel Xeon E5 (Sandy Bridge) processors and an Intel Xeon Phi Coprocessor (MIC Architecture) per each compute Node Each computer node has 32 GB of “host” memory with an additional 8GB of memory on the Xeon Phi coprocessor card 2.2 PF from Xeon E5 processors and 7.4 PF from Xeon Phi coprocessors Figures from tacc.utexas.edu

Hurricane Sandy Benchmark Coarser resolution 40-km (50x50) Time-step: 180 sec Finer resolution 4-km (500x 500) Time-step: 20 sec Time Period for both Simulation: 54 hour forecast Between 2012 Oct 27 12:00 UTC through 2012 Oct 29 18:00 UTC 60 vertical layers

Scalability Assessment (MPI Only) 500X500 horizontal grids MPI Bound Compute Bound Simulation Speed is duration of simulation per wall clock time

Scalability Assessment (MPI Only) Allinea Perfomance Reports Separate netcdf output file (io_form_history=102 in WRF namelist) 79% of total time spent on MPI I/O is calculated into MPI I/O - allenia PR calculating Wrf_opt=102 87% of total time spent on computation

Domain Decomposition (MPI only) Per Grid :1/4 Computation and 1/2 MPI

AVX compiler flag AVX (Intel® Advanced Vector Extensions) is a 256 bit instruction set extension More aggressive optimization Not working on intel 15 This issue has been reported to Intel Intel 15 is a little bit better than this!

Hybrid Parallelization Hybrid : distributed and shared memory parallelism(dmpar+smpar) As the number of threads increase the performance decreases The cores have never been oversubscribed Binding increases the performance significantly I_MPI_PROCESSOR_LIST= p1,p2 tacc_affinity script

Intel Vtune Amplifier XE Intel profiling and performance analysis tool Profiling includes stack sampling, thread profiling and hardware event sampling Collect performance statistics of different part of the code

What does make WRF expensive? Radiation Longwave Radiation Scheme RRTMG Scheme (ra_lw_physics =4) Shortwave Radiation Scheme CAM Scheme(ra_sw_physics = 3) Microphysics Scheme Thompson et al. 2008 (mp_physics =8) . But is this case representative of the significant effect of the dynamics on performance? Time(%)

Microphysics options summary Scheme mp_physics Simulation Speed # of Variables #timesteps/s Kessler 1 2493.6 3 13.8 Purdue Lin et al. 2 2043.8 6 11.3 WSM-3 2263.8 12.5 WSM-5 4 2012.3 5 11.2 Ferrier(current NAM) 2451.2 13.6 WSM-6 1859.5 10.3 Goddard 6 class 7 1929.9 10.6 Thompson et al. 8 1739.8 9.7 Milbrandt- Yau 2-moment 9 1189.3  13 6.6 Morison 2-moment 10 1475.5 8.2 WDM-5 14 1478.6  8 WDM-16 16 1358.8  9 7.5 Thompson microphysics is among the most expensive microphysics and it is widely used.

TAU tools TAU (Tuning and Analysis Utilities) is a program and performance analysis tool framework for high-performance parallel and distributed computing TAU can automatically instrument source code using a package called PDT for routines, loops, I/O, memory, phases, etc. Tau uses wallclock time and PAPI metrics to read hardware counters for profiling and tracing

Using Tau/PAPI for Advection Module 1- PDT instrumentation for module_advect_em 2- Manually instrumented code for higher granularity of desired loops TAU/PAPI variables analyzed: Time L1 and L2 Data Cache Misses (DCM) Conditional branch instructions mispredicted Floating point instruction and operations Single and double precision vector/SIMD instructions

Identified Hotspots 1- Positive Definite Advection Loop (32 lines) High Time High cache misses (both L1 and L2 Cache misses) High branch miss-prediction 2- x, y, z flux 5 advection equation loops High Cache misses Repeated through the code for different advection schemes

Moisture transport in ARW Until recently, many weather models did not conserve moisture because of the numerical challenges in advection schemes.  high bias in precipitation WRF-ARW scheme is conservative but not all of the advection schemes are. This introduces new masses to the system. Figure from Skamarock and Dudhia 2012 Advection schemes can introduce both positive and negative errors particularly at sharp gradients

Advection options in WRF moist_adv_opt =0 moist_adv_opt =1 moist_adv_opt =2 Explicit IFs to remove oscillations Explicit IFs to remove negative values and over shoots High number of explicit IFs are causing high branch mispredictions Figure from Skamarock and Dudhia 2012

The effect of optimization of advection module 1- Optimizing the positive definitive advection module Test1 : WRF only case Test 2: WRF-Chem case Case Advected Variables Maximum performance increase WRF Moisture 13% WRF-Chem Moisture- Tracers- Species-Scalars- Chemical concentration- Particles 21% * * The performance increase will be even significantly higher for dust and particle only WRF-Chem cases. This hotspot has a potential for being optimized and provides significant improvement in performance.

Identified Hotspots 1- Positive Definite Advection Loop High Time High cache misses (both L1 and L2 Cache misses) High branch miss-prediction 2- x, y, z flux 5 advection equation loops High Cache misses Repeated through the code for different advection schemes

The effect of optimization of advection equations 2- Flux 5 advection equations High Time and High L1 and L2 Data Cache Misses This loop is repeated throughout the code for x, y and z directions Very similar loop repeated for all the advection schemes Test1 : WRF 4 km benchmark with TAU instrumentation 58% time spent in advection is in these flux equations loops Many L1 data cache misses per iteration Many L2 data cache misses per iteration This hotspot has a potential for being optimized and provides significant improvement in performance.

Conclusion WRF shows good MPI scalability depending on the workload Thread Binding should be used for improving the performance of the WRF hybrid runs Intel Vtune Amplifier and Tau tools used for performance analysis of WRF code. Dynamics is identified as the most expensive part of ARW We identified the hotspots of the advection module and estimated the amount of performance increase from modifying these parts of the WRF code

Ongoing and Future Work Performance Improvement of advection module Analysis of hardware counters to fix branch mispredictions and cache misses Advection module vectorization for Intel Xeon Phi Coprocessors Reducing memory footprint by decreasing the number of temporary variables Exploring performance optimization with different compiler flags Loop transformation for enabling better vectorization

Acknowledgements Davide Del Vento Rich Loft Srinath Vadlamani Dave Gill Greg Carmichael All SIParCS admins and staff why

Microphysics Schemes Provides atmospheric heat and moisture tendencies Includes water vapor, cloud and precipitation processes Microphysical rates Surface rainfall Mielikainen et al. 2014

WRF Model Integration Procedure Begin time step Runge-Kutta loop (steps 1, 2, and 3) (i) advection, p-grad, buoyancy using (ii) physics if step 1, save for steps 2 and 3 (iii) mixing, other non-RK dynamics, save… (iv) assemble dynamics tendencies Acoustic step loop (i) advance U,V, then W, (ii) time-average U,V,W End acoustic loop Advance scalars using time-averaged U,V,W End Runge-Kutta loop Other physics (currently microphysics) End time step