Download presentation
Presentation is loading. Please wait.
Published byAmberly Chambers Modified over 9 years ago
1
DMI Update WWW.DMI.DK Leif Laursen ( ll@dmi.dk ) Jan Boerhout ( jboerhout@hpce.nec.com ) CAS2K3, September 7-11, 2003 Annecy, France
2
Danish Meteorological Institute DMI is the national weather service for Denmark, Greenland and the Faeroes. Weather forecasting, Oceanography, Climate Research and Environmental studies Use of numerical models in all areas Increased used of automatic products Demanding high availability of systems
4
32 Kbyte/s 10 Mbyte/s 2 4 processor SGI ORIGIN 200 data processing graphics verification operational database NEC-SX6 preprocessing analysis initialisation forecast postprocessing Mass storage device GTS-observations ECMWF boundary files
5
00Z ECMWF boundaries 12Z ECMWF boundaries 06Z ECMWF boundaries 18Z ECMWF boundaries
6
Evolution in RMS for MSLP
7
Quality of 24h forecasts of 10m wind speeds >= 8 m/s
8
Weibull distributions for 24 hour forecasts E, D, ECMWF and UKMO is also shown as well as curve for the observations.
14
The new NEC-SX6 computer at DMI April 2002March 2003 16+260+2 Memory96 Gbyte320 Gbyte Peak128 Gflops+ 16480 Gflops+ 16 Increase415 Disc1 Tbyte4 Tbyte Front end 2 AzusA systems with 4 cpu’s and 4 Gbyte each PLUS 2 AsamA systems with 4 cpu’s and 8 Gbyte each Total peak25.6 Gflops25.6+51.2 Gflops
16
Some events during the migration to the NEC-SX6 Oct. 01: Signature of contract between NEC and DMI April 02: Upgrade (advection scheme for q, CW and TKE) May 02: Installation of phase 1 of SX6 May 02: Parallel system on SX6 June 02: DMI-HIRLAM-I (0.014 degree, 602x600 grid) on SX-6 July 02: Stability test passed Sep. 02: Operational suite on SX6, later removal of SX4 Sep. 02: Testing of new developments (diff. and convection) Dec. 02: Upgrade: 40 levels, reduced time step, AMSU-A data Jan. 03: Revised contract between NEC and DMI Mar. 03: Installation of phase 2 of SX6 July 03: Stability test passed. Sep. 03: Improvement in data-assimilation (FGAT, QuikScat etc.) Early 04: New operational HIRLAM set-up using 6 nodes
20
HIRLAM Scalability Optimization Methods Implementation Performance
21
Optimization Focus Data transposition –from 2D to FFT distribution and reverse –from FFT to TRI distribution and reverse Exchange of halo points –between north and south –between east and west GRIB File I/O Statistics
22
Approach First attempt: straight-forward conversion from SHMEM to MPI-2 put/get calls –it works, but: –too much overhead due to fine granularity Redesign of transposition and halo swap routines less and larger messages independent message passing process groups
23
2D Sub Grids HIRLAM sub grid definition in TWOD data distribution Processors: 012 345 678 91011 latitude longitude levels
24
Original FFT Sub Grids HIRLAM sub grid definition in FFT data distribution Each processor handles slabs of full longitude lines latitude longitude levels
25
2D↔FFT Redistribution Sub grid data to be distributed to all processors: send-receive pairs 4 latitude longitude levels
26
345 2D↔FFT Redistribution Sub grids in east-west direction form full longitude lines nprocy independent sets of nprocx 2 send-receive pairs, or: send-receive pairs nprocy x less messages latitude longitude levels 5 4 3
27
Transpositions 2D↔FFT↔TRI 012 345 678 91011 2 1 0 5 4 3 8 7 6 10 9 25811 14710 0369 2DFFTTRI latitude longitude levels
28
MPI Methods Transfer Methods –Remote Memory Access:mpi_put, mpi_get –Async Point-to-Point:mpi_isend, mpi_irecv –All-to-All: mpi_alltoallv, mpi_alltoallw Buffering vs. direct –Explicit buffering –MPI derived types (Method selection by environment variables)
29
Test grid Details ParameterValueNotes Longitudes602 Latitudes568 Levels60 NSTOP40steps Initializationnone Time step180seconds Performance
30
Parallel Speedup on NEC SX-6 Cluster of 8 NEC SX-6 nodes at DMI Up to 60 processors: 7 nodes with 8 processors per node 1 node with 4 processors Parallel efficiency 78% on 60 processors
31
Performance - Observations New data redistribution method much more efficient (78% vs. 45% on 60 processors) No performance advantage with RMA (one- sided MP) or All-to-All over plain Point-to-Point method Elegant code with MPI derived types, but: Explicit buffering faster
32
Questions? Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.