4D-VAR Optimization Efficiency Tuning

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.

Assimilation Algorithms: Tangent Linear and Adjoint models Yannick Trémolet ECMWF Data Assimilation Training Course March 2006.

Status and performance of HIRLAM 4D-Var Nils Gustafsson.

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Inter-Processor communication patterns in weather forecasting models Tomas Wilhelmsson Swedish Meteorological and Hydrological Institute Sixth Annual Workshop.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

Programming with Shared Memory Introduction to OpenMP

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Parallel Programming in Java with Shared Memory Directives.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Independent Component Analysis (ICA) A parallel approach.

O PEN MP (O PEN M ULTI -P ROCESSING ) David Valentine Computer Science Slippery Rock University.

Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.

Hybrid MPI and OpenMP Parallel Programming

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Introduction to OpenMP

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

MPI and OpenMP.

Threaded Programming Lecture 2: Introduction to OpenMP.

Single Node Optimization Computational Astrophysics.

Slide 1 NEMOVAR-LEFE Workshop 22/ Slide 1 Current status of NEMOVAR Kristian Mogensen.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

Parallel Programming By J. H. Wang May 2, 2017.

Computer Engg, IIT(BHU)

Introduction to OpenMP

Steven Ge, Xinmin Tian, and Yen-Kuang Chen

Computer Science Department

Parallel Algorithm Design

Parallel Programming in C with MPI and OpenMP

Integrated Runtime of Charm++ and OpenMP

Hybrid Parallel Programming

Hybrid Programming with OpenMP and MPI

Hybrid Parallel Programming

Introduction to OpenMP

Multithreading Why & How.

Department of Computer Science, University of Tennessee, Knoxville

Parallel Computing Explained How to Parallelize a Code

Parallel Programming in C with MPI and OpenMP

Hybrid Parallel Programming

Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.

Presentation transcript:

4D-VAR Optimization Efficiency Tuning Tomas Wilhelmsson Swedish Meteorological and Hydrological Institute (now ECMWF) 2009-05-14

Previous HIRLAM 4D-VAR performance scaling Torgny Faxén (NSC): Niko Sokka (FMI): “In the end of day, 4DVAR scales up to 84 processors in our system and then stalls.” Signatur 06/04/08

Efforts to improve HIRLAM 4D-VAR scaling Switch from 1D to 2D partitioning? Signatur 06/04/08

Transposes with 1D decomposition 06/04/08

Transposes with 2D partitioning 06/04/08

Efforts to improve HIRLAM 4D-VAR scaling Switch from 1D to 2D partitioning? Evaluated in a toy model 2005, no clear benefit Signatur 06/04/08

Efforts to improve HIRLAM 4D-VAR scaling Switch from 1D to 2D partitioning? Evaluated in a toy model 2005, no clear benefit Reduce interprocessor communication Signatur 06/04/08

HIRLAM 4D-VAR 1D partitioning SMHI C22 area inner loop at 1/3 resolution, with 30 minute time step 103 x 103 grid distributed over 26 processors, each get a 103x4 slice Signatur 06/04/08

Accumulated stencils in halo zone Signatur 06/04/08

Efforts to improve HIRLAM 4D-VAR scaling Switch from 1D to 2D partitioning? Evaluated in a toy model 2005, no clear benefit Reduce interprocessor communication Slswap on demand, implemented 2008 Investigate “HALO-lite” (Mozdzynski, 2008) approach? Signatur 06/04/08

Efforts to improve HIRLAM 4D-VAR scaling Switch from 1D to 2D partitioning? Evaluated in a toy model 2005, no clear benefit Reduce interprocessor communication Slswap on demand, implemented 2008 Investigate “HALO-lite” (Mozdzynski, 2008) approach? Reduce work Fewer FFTs, which also reduces communication (Gustafsson, 2008)‏ Signatur 06/04/08

Efforts to improve HIRLAM 4D-VAR scaling Switch from 1D to 2D partitioning? Evaluated in a toy model 2005, no clear benefit Reduce interprocessor communication Slswap on demand, implemented 2008 Investigate “HALO-lite” (Mozdzynski, 2008) approach? Reduce work Fewer FFTs, which also reduces communication (Gustafsson, 2008)‏ Additional sources of parallelism OpenMP Signatur 06/04/08

1D decomposition => a strict upper bound on number of MPI tasks SMHI C22 area has 306x306 grid points: 4D-VAR inner loop at one third resolution with 103x103 grid points. At least 2 rows per task (for efficiency) gives at most 52 MPI-tasks! That is only 12% (7 out of 56 nodes) of SMHI’s operational cluster. C11 area could use 24% of the machine, but has 4 times the work Even worse on future computers (Moore’s law) OpenMP can enable better use of multi-core processors 06/04/08

TL time step on 26 processors Signatur 06/04/08

OpenMP implementation Physics was already “done” (through phcall and phtask)‏ But check len_loops in namelist (thread granularity) Semi-Langrangian dynamics by parallelizing outer “vertical” loop Vertical loop over 40 to 60 levels 572 !$omp directives added in spdy/*.F Large parallel sections, using orphaned directives Some nowait directives MPI calls within !$omp master / !$omp end master Signatur 06/04/08

VERINT_AD: adjoint of interpolation stencil Departure points are interpolated with a 4*4*4 stencil can be computed one level at a time In its adjoint, contributions are summed up over the stencil writing to 4 levels => write races with parallelized vertical loop First implementation: use omp_set_lock / omp_unset_lock to limit concurrent writes to each vertical level Slower than serial version! Second implementation: reuse the #ifdef VECTOR code  Put contributions in temporary vector Sum within !$omp critical / !$omp end critical Scales beautifully 06/04/08

HOWTO combine Intel compiler OpenMP with Scali MPI Set number of OpenMP threads setenv OMP_NUM_THREADS 4 Intel compiler runtime environment setenv KMP_LIBRARY turnaround setenv KMP_STACKSIZE 128m setenv KMP_AFFINITY "none,granularity=core“ setenv KMP_VERBOSE TRUE ScaMPI version 3.13.8-5915 bug workaround setenv SCAFUN_CACHING_MODE 0 Launch (with socket affinity for up to 4 threads) mpirun -affinity_mode automatic:bandwidth:socket mpirun -affinity_mode none 06/04/08

HIRLAM 7. 2 4D-VAR OpenMP performance on gimle. nsc. liu HIRLAM 7.2 4D-VAR OpenMP performance on gimle.nsc.liu.se (8 cores per node) SMHI C22 area, 306x306 grid, 40 “minimize” iterations Inner TL/AD loop at 1/3 resolution, 103x103 grid Times for “minimize” (no I/O): 13 nodes, best configuration for pure MPI or MPI/OpenMP hybrid: 126 seconds with 52 tasks 91 seconds with 26 tasks and 4 threads/task 26 nodes: 92 seconds with 52 tasks 65 seconds with 52 tasks and 4 threads/task Best results with OpenMP within sockets, and MPI between Signatur 06/04/08

Is 4D-VAR operationally feasible for SMHI’s C11 area? C11 area, 606x606 grid, 120 “minimize” iterations, total runtime Inner TL/AD loop at 1/3 resolution, 203x203 grid Time (seconds) Model version Configuration Comment ca 3600 7.1.2 24 nodes, 192 tasks, len_loops=2047 Original setup based on current operational C22 1752 26 nodes, 51 tasks, len_loops=16 Improved configuration 1486 7.2 as above Fewer FFTs, slswap on demand 1387 as above, and 102 tasks 2*tasks 1286 7.3 as above, and 2 threads => 204 cores 2 OpenMP threads 1086 26 nodes, 51 tasks, 4 threads => 204 cores Best setup! Signatur 06/04/08

Conclusion OpenMP in HIRLAM 4D-VAR necessary for today’s clusters Slswap on demand works, and expected to be more important for high numbers of processor Nothing beats avoiding work completely No silver bullet Performance from small incremental improvements Technology will allow larger areas “for free”, but increasing the number of time steps and maintaining runtime will be harder Signatur 06/04/08

ITC - Intel Thread Checker Finds OpenMP parallelization bugs, like possible race conditions Very easy to use Compile ifort –openmp –tcheck No optimization (any –O is ignored) No specific thread knowledge allowed, like omp_get_thread_num() Instrument and run tcheck_cl a.out Takes an order of magnitude more memory and runtime Signatur 06/04/08

Intel Thread Checker – Output Forgot to declare w0 as private in horint_ad_local.F Wrongly placed nowait in calintf.F 06/04/08