Recent Developments for Parallel CMAQ Jeff Young AMDB/ASMD/ARL/NOAA David Wong SAIC – NESCC/EPA
Running in “quasi-operational” mode at NCEP 26+ minutes for a 48 hour forecast (on 33 processors) AQF-CMAQ
Some background: MPI data communication – ghost (halo) regions Code modifications to improve data locality Some vdiff optimization Jerry Gipson’s MEBI/EBI chem solver Tried CGRID( SPC, LAYER, COLUMN, ROW ) – tests Indicated not good for MPI communication of data
DO J = 1, N DO I = 1, M DATA( I,J ) = A( I+2,J ) * A( I,J-1 ) END DO Ghost (Halo) Regions
Exterior Boundary Ghost Region Stencil Exchange Data Communication Function Horizontal Advection and Diffusion Data Requirements
Architectural Changes for Parallel I/O Tests confirmed latest releases (May, Sep 2003) not very scalable Background: Original Version Latest Release AQF Version
Parallel I/O Read and computation Write and computation 2002 Release 2003 Release AQF Computation only Write only Read, write and computation asynchronous Data transfer by message passing
Modifications to the I/O API for Parallel I/O For 3D data, e.g. … INTERP3 (FileName, VarName,ProgName,Date, Time, Data_Buffer ) StartCol, EndCol, StartRow, EndRow,StartLay, EndLay, Ncols*Nrows*Nlays, XTRACT3 (FileName, VarName, StartLay, EndLay, StartRow, EndRow, StartCol, EndCol, Data_Buffer ) Date, Time, INTERPX (FileName, VarName,ProgName, WRPATCH (FileID, VarID,TimeStamp,Record_No ) Called from PWRITE3 (pario) time interpolation spatial subset
DRIVER read IC’s into CGRID begin output timestep loop advstep (determine sync timestep) couple begin sync timestep loop SCIPROC X-Y-Z advect adjadv hdiff decouple vdiff DRYDEP cloud WETDEP gas chem (aero VIS) couple end sync timestep loop decouple write conc and avg conc CONC, ACONC end output timestep loop Standard (May 2003 Release)
DRIVER set WORKERs, WRITER if WORKER, read IC’s into CGRID begin output timestep loop if WORKER advstep (determine sync timestep) couple begin sync timestep loop SCIPROC X-Y-Z advect hdiff decouple vdiff cloud gas chem (aero) couple end sync timestep loop decouple MPI send conc, aconc, drydep, wetdep, (vis) if WRITER completion-wait: for conc, write conc CONC ; for aconc, write aconc, etc. end output timestep loop AQF CMAQ
Switch Power3 Cluster (NESCC’s LPAR) Memory The platform (cypress00, cypress01, cypress02) consists of 3 SP-Nighthawk nodes All cpu’s share user applications with file servers, interactive use, etc. 2X slower than NCEP’s
Switch Memory Power4 p690 Servers (NCEP’s LPAR) Each platform (snow, frost) is composed of 22 p690 (Regatta) servers Each server has 32 cpu’s LPAR-ed into 8 nodes per server (4 cpu’s per node) Some nodes are dedicated to file servers, interactive use, etc. There are effectively 20 servers for general use (160 nodes, 640 cpu’s) 2 “colony” SW connectors/node ~2X performance of cypress nodes
Internal Network “ice” Beowulf Cluster Pentium 3 – 1.4 GHz Memory Isolated from outside network traffic
Network “global” MPICH Cluster Pentium 4 XEON – 2.4 GHz Memory globalglobal1 global2global3global4global5
RESULTS 5 hour, 12Z – 17Z and 24 hour, 12Z – 12Z runs 20 Sept 2002 test data set used for developing the AQF-CMAQ Input Met from ETA, processed thru PRDGEN and PREMAQ Domain seen on following slides CB4 mechanism, no aerosols, Pleim’s Yamartino advection for AQF-CMAQ 166 columns X 142 rows X 22 layers at 12 km resolution PPM advection for May 2003 Release
The Matrix run wall = comparison of run times = comparison of relative wall times 4 = 2 X 2 8 = 4 X 2 32 = 8 X 4 16 = 4 X 4 64 = 8 X 8 64 = 16 X 4 for main science processes
24–Hour AQF vs Release AQF-CMAQ 2003 Release Data shown at peak hour
24–Hour AQF vs Release Less than 0.2 ppb diff between Almost 9 ppb max diff between Yamartino and PPM AQF on cypress and snow
Absolute Run Times 24 Hours, 8 Worker Processors Sec AQF vs Release
AQF vs Release on cypress and AQF on snow Absolute Run Times 24 Hours, 32 Worker Processors Sec
% of Slowest AQF CMAQ – Various Platforms Relative Run Times 5 Hours, 8 Worker Processors
% of Slowest Relative Run Times 5 Hours AQF-CMAQ cypress vs. snow
% of Slowest Number of Worker Processors AQF-CMAQ on Various Platforms Relative Run Times 5 Hours
% AQF-CMAQ on cypress Relative Wall Times 5 Hours
% AQF-CMAQ on snow Relative Wall Times 5 Hours
Relative Wall Times 24 hr, 8 Worker Processors AQF vs Release on cypress % PPM Yamo
AQF vs Release on cypress Relative Wall Times 24 hr, 8 Worker Processors Add snow for 8 and 32 processors %
AQF Horizontal Advection Relative Wall Times 24 hr, 8 Worker Processors x-r-l x-row-loop y-c-l -hppm y-column-loop kernel solver in each loop %
AQF Horizontal Advection Relative Wall Times 24 hr, 32 Worker Processors %
AQF & Release Horizontal Advection Relative Wall Times 24 hr, 32 Worker Processors r- release %
Future Work Add aerosols back in TKE vdiff Improve horizontal advection/diffusion scalability Some I/O improvements Layer variable horizontal advection time steps
Aspect Ratios
Aspect Ratios