Presentation is loading. Please wait.

Presentation is loading. Please wait.

12/5/20151 CCSM4 - A Flexible New Infrastructure for Earth System Modeling Mariana Vertenstein NCAR CCSM Software Engineering Group.

Similar presentations


Presentation on theme: "12/5/20151 CCSM4 - A Flexible New Infrastructure for Earth System Modeling Mariana Vertenstein NCAR CCSM Software Engineering Group."— Presentation transcript:

1 12/5/20151 CCSM4 - A Flexible New Infrastructure for Earth System Modeling Mariana Vertenstein NCAR CCSM Software Engineering Group

2 Major Infrastructure Changes since CCSM3 CCSM4/CPL7 development could not have occurred without the following collaborators – DOE/SciDAC Oak Ridge National Laboratory (ORNL) Argonne National Laboratory (ANL) Los Alamos National Laboratory (LANL) Lawrence Livermore National Laboratory (LLNL) – NCAR/CISL – ESMF 12/5/20152

3 Outline What are software requirements of community earth model? Overview of current CCSM4 How does CCSM4 address requirements? – Flexibility permits greater efficiency, throughput, ease of porting and model development How is CCSM4 being used in new ways? – Interactive ensembles - extending traditional definition of component – Extending CCSM to ultra high resolutions What is CCSM4 Scalability and Performance? Upcoming releases and new CCSM4 scripts 12/5/20153

4 4 CESM General Software Requirements Scientific Consistency One code base - “stand-alone” development component code base is same as in fully coupled system User Friendly Component Parameterization Model system permits each component to be developed and tested independently on even one processor (e.g. CAM/SCAM) Extensibility Design provides extensibility to add new components (e.g. land-ice) and new coupling strategies (interactive ensembles, data assimilation capability) Performance/Efficiency/Porting Coupling architecture/components can be easily ported and run effectively at low resolution (e.g. paleo) and ultra-high resolution on thousands of pes.

5 12/5/20155 Specific High Resolution Requirements Capability to use both MPI and OpenMP effectively to address requirements of new multi-core architectures Scalable and flexible coupling infrastructure Parallel I/O throughout model system (for both scalable memory and performance) Scalable memory (minimum global arrays) for each component

6 6 CCSM4 Overview Consists of a set of 4 (5 for CESM) geophysical component models on potentially different grids that exchange boundary data with each other only via communication with a coupler (hub and spoke architecture) – New science is resulting in sharply increasing number of fields being communicated between components Large code base: >1M lines – Fortran 90 (mostly) – Developed over 20+ years – 200-300K lines are critically important --> no comp kernels, need good compilers Collaborations are critical – DOE/SciDAC, University Community, NSF (PetaApps), ESMF

7 12/5/2015 CAM Modes: Multiple Dycores, Multiple Chemistry Options, WACCM, single column CAM DATM (WRF) Atmosphere Component Data-ATM: Multiple Forcing/Physics Modes CLM Modes: no BGC, BGC, Dynamic-Vegetation, BGC-DV, Prescribed-Veg, Urban CLM DLND (VIC) Land Component Data-LND: Multiple Forcing/Physics Modes CICE Modes: Fully Prognostic, Prescribed CICE DICE Ice Component Data-ICE : Multiple Forcing/Physics Modes POP Modes: Ecosystem, Fully-coupled, Ocean-only, Multiple Physics Options POP DOCN(SOM/DOM) (ROMS) Ocean Component Data-OCN : Multiple Forcing/Physics Modes (SOM/DOM) New Land Ice Component Coupler Regridding, Merging, Calculation of ATM/OCN fluxes, Conservation diagnostic What are the CCSM Components?

8 CCSM Component Grids Ocean and Sea-Ice must run on same grid – displaced pole, tripole Atmosphere and Land can now run on different grids – these in general are different from the ocean/ice grid – lat/lon, but also new cubed sphere for CAM Globally grids span low resolution (3 degree) to ultra-high – 0.25  ATM/LND [1152 x 768] – 0.50  ATM/LND [576 x 384] – 0.1  OCN/ICE [3600 x 2400] Regridding – Done in parallel at runtime using mapping files that are generated offline using SCRIP – In past, grids have been global and logically rectangular – but now can have single point, regional, cubed sphere … – Regridding issues are rapidly becoming a higher priority 12/5/20158

9 CCSM Component Parallelism MPI/OpenMP – CAM, CLM, CICE, POP have MPI/OpenMP hybrid capability – Coupler only has MPI capability – Data models only have MPI capability Parallel I/O (use of PIO library) – CAM, CICE, POP, CPL, Data models all have PIO capability 12/5/20159

10 New CCSM4 Architecture processors New Single Executable CCSM4 architecture (cpl7) time CPL (regridding, merging) CAM CLM CICE Driver (controls time evolution) POP Sequential Layout processors Hybrid Sequential/Concurrent Layouts CAM CLM CICE POP Driver CPL Original Multiple Executable CCSM3 architecture (cpl6) CAM CLM CICE POP CPL time processors

11 12/5/201511 Advantages of CLP7 Design New flexible coupling strategy – Design targets a wide range of architectures - massively parallel peta-scale hardware, smaller linux clusters, and even single laptop computers – Provides efficient support of varying levels of parallelism via simple run-time configuration for processor layout New CCSM4 scripts provide one simple xml file to specify processor layout of entire system and automated timing information to simplify load balancing Scientific unification – ALL model development done with one code base - elimination of separate stand-alone component code bases (CAM, CLM) Code Reuse and Maintainability – Lowers cost of support/maintenance

12 12/5/201512 More CPL7 advantages… Simplicity – Easier to debug - much easier to understand time flow – Easier to port – ported to IBM p6 (NCAR) Cray XT4/XT5 (NICS,ORNL,NERSC) BGP (Argonne), BGL (LLNL) Linux Clusters (NCAR, NERSC, CCSM4-alpha users) – Easier to run - new xml-based scripts permit user-friendly capability to create “out-of-box” experiments Performance (throughput and efficiency) – Much greater flexibility to achieve optimal load balance for different choices of Resolution, Component combinations, Component physics – Automatically generated timing tables provide users with immediate feedback on both performance and efficiency

13 CCSM4 Provides a Seamless End-to-End Cycle of Model Development, Integration and Prediction with One Unified Model Code Base 12/5/201513

14 New frontiers for CCSM Using the coupling infrastructure in novel ways – Implementation of interactive ensembles Pushing the limits of high resolution – Capability to really exercise the scalability and performance of the system 12/5/201514

15 CCSM4 and PetaApps CCSM4/CPL7 is integral piece of NSF Petaapps award CCSM4/CPL7 is integral piece of NSF Petaapps award – Funded 3 year effort aimed at climate science capability for petascale systems – Funded 3 year effort aimed at advancing climate science capability for petascale systems – NCAR, COLA, NERSC, U. Miami – Interactive ensembles using CCSM4/CPL7 involves both computational and scientific challenges used to understand how oceanic, sea-ice and atmospheric noise impacts climate variability used to understand how oceanic, sea-ice and atmospheric noise impacts climate variability can also scale out to tens of thousands of processors can also scale out to tens of thousands of processors – Also examine use of PGAS language in CCSM 12/5/201515

16 12/5/201516 CLM CPL POP Driver CAM time processors Interactive Ensembles and CPL7 All Ensemble members run concurrently on non-overlapping processor sets Communication with coupler takes place serially over ensemble members Setting new number of ensembles requires editing 1 line of an xml file 35M CPU hours TeraGrid [2 nd largest] All Ensemble members run concurrently on non-overlapping processor sets Communication with coupler takes place serially over ensemble members Setting new number of ensembles requires editing 1 line of an xml file 35M CPU hours TeraGrid [2 nd largest] CICE time POP Driver CLM CICE CAM processors POP CPL Currently being used to perform ocean data assimilation (using DART) for POP2

17 CCSM4 and Ultra High Resolution DOE/LLNL Grand Challenge Simulation –.25° atmosphere/land and.1° ocean/ice – Multi-institutional collaboration (ANL, LANL, LLNL, NCAR, ORNL) – First ever U.S. multi-decadal global climate simulation with eddy resolving ocean and high resolution atmosphere 0.42 sypd on 4048 cpus (Atlas LLNL cluster) 20 years completed 100 GB/simulated month 12/5/201517

18 Ultra High Resolution (cont) NSF/PetaApps Control Simulation (IE baseline) – John Dennis (CISL) has carried this out –.5° atmosphere/land and.1° ocean/ice – Control run in production @ NICS (Teragrid) 1.9 sypd on 5848 quad-core XT5 cpus (4-5 months continuous simulation) 155 years completed 100TB of data generated (generating 0.5-1 TB per wall clock day) 18M CPU hours used – Transfer output from NICS to NCAR (100 – 180 MB/sec sustained) – archive on HPSS – Data analysis using 55 TB project space at NCAR 12/5/201518

19 Next steps at high resolution Future work – Use OpenMP capability in all components effectively to take advantage multi-core architectures Cray XT5 hex-core and BG/P – Improve disk I/O performance [currently using 10 - 25% of time] – Improve memory footprint scalability Future simulations –.25° atm/.1° ocean – T341 atm/.1° ocean (effect of Eulerian dycore) – 1/8° atm (HOMME)/.25° land/.1° ocean 12/5/201519

20 CCSM4 Scalability and Performance 12/5/201520

21 12/5/201521 New Parallel I/O library (PIO) Interface between the model and the I/O library. Supports – Binary – NetCDF3 (serial netcdf) – Parallel NetCDF (pnetcdf) (MPI/IO) – NetCDF4 User has enormous flexibility to choose what works best for their needs – Can read one format and write another Rearranges data from model decomp to I/O friendly decomp (rearranger is framework independent) – model tasks and I/O tasks can be independent

22 PIO in CCSM PIO implemented in CAM, CICE and POP Usage is critical for high resolution, high processor count simulations – Serial I/O is one of the largest sources of global memory in CCSM - will eventually always run out of memory – Serial I/O results in serious performance penalty at higher processor counts Performance benefit noticed even with serial netcdf (model output decomposed on output I/O tasks) 12/5/201522

23 CPL scalability Scales much better than previous version – both in memory and throughput Inherently involves a lot of communication versus flops New coupler has not been a bottleneck in any configuration we have tested so far – other issues such as load balance and scaling of other processes have dominated Minor impact at 1800 cores (kraken peta-apps control) 23

24 CCSM4 Cray XT Scalability 24 (Courtesy of John Dennis) POP 4028 CAM 1664 CICE 1800 CPL 1800 processors time 1.9 sypd on 5844 cores with i/o on kraken quad-core XT5

25 12/5/201525 CAM/HOMME Dycore Cubed-sphere grid overcomes dynamical core scalability problems inherent with lat/lon grid Work of Mark Taylor (SciDAC), Jim Edwards (IBM), Brian Eaton(CSEG) PIO library used for all I/O (work COULD NOT have been done without PIO) BGP (4 cores/node): Excellent scalability down to 1 element per processor (86,200 processors at 0.25 degree resolution). JaguarPF (12 cores/node): 2-3x faster per core than BGP, scaling not as good - 1/8 degree run loosing scalability at 4 elements per processor PIO library used for all I/O (work COULD NOT have been done without PIO) BGP (4 cores/node): Excellent scalability down to 1 element per processor (86,200 processors at 0.25 degree resolution). JaguarPF (12 cores/node): 2-3x faster per core than BGP, scaling not as good - 1/8 degree run loosing scalability at 4 elements per processor

26 CAM/HOMMME Real Planet: 1/8° Simulations CCSM4 - CAM4 physics configuration with cyclical year 2000 ocean forcing data sets – CAM-HOMME 1/8°, 86400 cores − CLM2 on lat/lon 1/4°, 512 cores − Data ocean/ice, 1°, 512 cores − Coupler, 8640 cores Jaguarpf simulation − Excellent scalability: 1/8 degree running at 3 SYPD on Jaguar − Large scale features agree well with Eulerian and FV dycores Runs confirm that the scalability of the dynamical core is preserved by CAM and the scalability of CAM is preserved by CCSM real planet configuration.

27 How will CCSM4 be released? - Leverage Subversion revision control system - Source code and Input Data obtained from Subversion servers (not tar files) - Output data of control runs from ESG - Advantages: - Easier for CSEG to produce frequent updates - Flexible way to have users obtain new updates of source code (and bug fixes) - Users can leverage Subversion to merge new updates into their “sandbox” with their modifications 12/5/201527

28 12/5/201528 Obtaining the Code and Updates Subversion Source Code Repository (Public) https://svn-ccsm-release.cgd.ucar.edu svn co obtain ccsm4.0 code make your own modifications in your sandbox obtain new code updates and bug fixes which are merged by subversion with your own changes svn merge

29 Creating an Experimental Case New CCSM4 Scripts Simplify: – Porting CCSM4 to your machine – Creating your experiment and obtaining necessary input data for your experiment – Load Balancing your experiment – Debugging your experiment- if something goes wrong during the simulation (never happen of course) - simpler to determine what it is 12/5/201529

30 Porting to your machine CCSM4 scripts contain a set of supported machines – user can run out of the box CCSM4 scripts also support a set of “generic” machines (e.g. linux clusters with a variety of compilers) – user still needs to determine which generic machine most closely resembles their machine and needs to customize Makefile macros for their machine – user feedback will be leveraged to continuously upgrade the generic machine capability post- release 12/5/201530

31 Obtaining Input Data Input data is now in Subversion repository Entire input data is about 900 GB and growing CCSM4 scripts permit user automatically obtain only the input data need for a given experimental configuration 12/5/201531

32 12/5/201532 Accessing input data for your experiment Set up experiment create_newcase (component set, resolution, machine) determine local root directory where all input data will go (DIN_LOC_ROOT) Subversion Input Data Repository (Public) https://svn-ccsm-inputdata.cgd.ucar.edu use check_input_data – export to automatically obtain ONLY required datasets for experiment in DIN_LOC_ROOT load balance your experimental configuration (use timing files) use check_input_data to see of required datasets are present in DIN_LOC_ROOT run Experiment

33 Load Balancing Your Experiment Load balancing exercise must be done before starting an experiment – Repeat short experiments (20 days) without I/O and adjust processor layout to – optimize throughput – minimize idle time (maximize efficiency) Detailed timing results are produced with each run Makes load balancing exercise much simpler than in CCSM3 12/5/201533

34 Load Balancing CCSM Example Idle time/cores 1664 cores POP CICE Processors CAM CPL7 Time CLM Increase core count for POP 3136 cores 1.53 SYPD POP CICE Processors CAM CPL7 Time CLM 4028 cores 1664 cores 2.23 SYPD Reduced Idle time

35 CCSM4 Releases and Timelines January 15, 2010: CCSM4.0 alpha release - to subset of users and vendors with minimal documentation (except for script's User's Guide) April 1, 2010: CCSM4.0 release - Full documentation, including User's Guide, Model Reference Documents, and experimental data June 1, 2010: CESM1.0 release ocean ecosystem, CAM-AP, interactive chemistry, WACCM New CCSM output data web design underway (including comprehensive diagnostics) 12/5/201535

36 12/5/201536 CCSM4.0 alpha release Extensive CCSM4 User’s Guide already in place apply for alpha user access at www.ccsm.ucar.e du/models/ccsm4. 0

37 Upcoming Challenges This year – Carry out IPCC simulations – Release CCSM4 and CESM1 and updates – Resolve performance and memory issues with ultra-high resolution configuration on Cray XT5 and BG/P – Create user-friendly validation process for porting to new machines On the horizon – Support regional grids – Nested regional modeling in CPL7 – Migration to optimization for GPUs 37

38 6/23/09 Big Interdisciplinary Team! Contributors: D. Bader (ORNL) D. Bailey (NCAR) C. Bitz (U Washington) F. Bryan (NCAR) T. Craig (NCAR) A. St. Cyr (NCAR) J. Dennis (NCAR) B. Eaton (NCAR) J. Edwards (IBM) B. Fox-Kemper (MIT,CU) N. Hearn (NCAR) E. Hunke (LANL) B. Kauffman (NCAR) E. Kluzek (NCAR) B. Kadlec (CU) D. Ivanova (LLNL) E. Jedlicka (ANL) E. Jessup (CU) R. Jacob (ANL) P. Jones (LANL) J. Kinter (COLA) A. Mai (NCAR) Funding: – DOE-BER CCPP Program Grant DE-FC03-97ER62402 DE-PS02-07ER07-06 DE-FC02-07ER64340 B&R KP1206000 – DOE-ASCR B&R KJ0101030 – NSF Cooperative Grant NSF01 – NSF PetaApps Award Computer Time: – Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) LLNL Stony Brook & BNL – CRAY XT time: NICS/ORNL NERSC Sandia S. Mishra (NCAR) S. Peacock (NCAR) K. Lindsay (NCAR) W. Lipscomb (LANL) R. Loft (NCAR) R. Loy (ANL) J. Michalakes (NCAR) A. Mirin (LLNL) M. Maltrud (LANL) J. McClean (LLNL) R. Nair (NCAR) M. Norman (NCSU) N. Norton (NCAR) T. Qian (NCAR) M. Rothstein (NCAR) C. Stan (COLA) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) J. Wolfe (NCAR) P. Worley (ORNL) M. Zhang (SUNYSB) 38

39 Thanks! Questions? CCSM4.0 alpha release page at www.ccsm.ucar.edu/models/ccsm4.0 12/5/201539


Download ppt "12/5/20151 CCSM4 - A Flexible New Infrastructure for Earth System Modeling Mariana Vertenstein NCAR CCSM Software Engineering Group."

Similar presentations


Ads by Google