Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K processors [Jan 2008] –TACC Sun : 55K processors [Jan 2008] –ANL Blue Gene/P : 160K processors [Jan 2008]
CCSM and Component Models –POP (Ocean) –CICE (Sea Ice) –CLM (Land Model) –CPL (Coupler) –CAM (Atmosphere) –CCSM
Status of POP (John Dennis) –17K Cray XT4 processors [12.5 years/day] –29K IBM Blue Gene/L [8.5 years/day] (BG Ready in Expedition Mode) Parallel I/O [Underway] Land causes load imbalance at 0.1 degree resolutions
Status of CAM (John Dennis) –CAM HOMME In Expedition Mode – Standard CAM “may be” run at 1 degree resolution or slightly higher on BG
Simulation rate for HOMME: Held-Suarez 1/2 1/3 1/4
CAM & CCSM BG/L Expedition not from climate scientists Parallel I/O is the biggest bottleneck
Cloud Resolving Models/LES Active Tracer High-resolution Atmospheric Model (ATHAM): modularized parallel-ready (MPI) Goddard Cloud Ensemble Model (GCE): well-established ( 70s- present) parallel-ready (MPI) scales linearly (99% up to 256 tasks) comprehensive
Implementations Been done (NERSC IBM SP, GFSC): ATHAM: 2D & 3D bulk cloud physics GCE: 3D bulk cloud physics 2D size-bins cloud physics Being & to be done (Blue Gene): GCE(ATHAM): 3D size-bins cloud physics larger domain longer simulation period finer resolution …
From: John Michalakes, NCAR
Model domains are decomposed for parallelism on two-levels Patch: section of model domain allocated to a distributed memory node Tile: section of a patch allocated to a shared-memory processor within a node; this is also the scope of a model layer subroutine. Distributed memory parallelism is over patches; shared memory parallelism is over tiles within patches Slide Courtesy: NCAR Single version of code for efficient execution on: –Distributed-memory –Shared-memory –Clusters of SMPs –Vector and microprocessors Parallelism in WRF: Multi-level Decomposition Logical domain 1 Patch, divided into multiple tiles Inter-processor communication
NCAR WRF Issues With Bluegene/L (from John Michalakes) Relatively slow I/O Limited memory per node Relatively poor processor performance “Lots of of little gotchas mostly related to immaturity, especially in the programming environment.”