Download presentation
Presentation is loading. Please wait.
Published byMatilda Bennett Modified over 9 years ago
1
www.ccsm.ucar.edu Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu
2
www.ccsm.ucar.edu Outline General review of CCSM Setting up and running a simple case Datasets Production Modifying source code Errors Tools Performance
3
www.ccsm.ucar.edu Review of CCSM Five components / Ten models –Atmosphere(3) : atm, datm, latm –Ocean(2) : ocn, docn –Land(2) : lnd, dlnd –Ice(2+) : ice, ice (prescribed mode), ice (mixed layer ocean mode), dice –Coupler(1) : cpl Communication via MPI between components and coupler only Each component runs on multiple processors via MPI, OpenMP, MPI/OpenMP
4
www.ccsm.ucar.edu Component parallelization atm : MPI, OpenMP, or MPI/OpenMP lnd : MPI, OpenMP, or MPI/OpenMP Ice : MPI only ocn : MPI only cpl : OpenMP only The data models, datm, docn, dice, dlnd, and latm : serial only, 1 processor
5
www.ccsm.ucar.edu Configurations A = datm, dlnd, docn, dice, cpl B = atm, lnd, ocn, ice, cpl C = datm, dlnd, ocn, dice, cpl D = datm, dlnd, docn, ice, cpl F = atm, lnd, docn, ice (prescribed mode), cpl G = latm, dlnd, ocn, ice, cpl H = atm, dlnd, docn, dice, cpl I = datm, lnd, docn, dice, cpl K = atm, lnd, docn, dice, cpl M = latm, dlnd, docn, ice (ml ocn mode), cpl
6
www.ccsm.ucar.edu Resolutions atm/lnd/datm/dlnd = T42, T31 ocn/ice/docn/dice = gx1v3, gx3, gx3v4 latm = T62 Scientifically validated combinations –B, T42_gx1v3 = b20.007 control run (test.a1 case) –B, T31_gx3v4 = paleo control run (test.a2 case)
7
www.ccsm.ucar.edu “Available” configurations ABCDFGHIKM T42_gx1v3 ******** T31_gx3 ******* T31_gx3v4 * T62_gx1v3 ** T62_gx3 ** = supported (subject to change) = b20.007 control = paleo control * * *
8
www.ccsm.ucar.edu Platforms IBM SGI Compaq*
9
www.ccsm.ucar.edu Review of scripts Main script (test.a1.run) –Sets primary ccsm environment variables –Calls $model.setup.csh Gets input datasets Builds components –Runs model –Archives –Harvests
10
www.ccsm.ucar.edu Setting up a simple case Use the GUI !! –The GUI modifies the scripts and creates a new case for you –Input $CASE, $CSMROOT, $CSMDATA, $EXEROOT –Input resolution –Input configuration (A-M) –Sets processor layout based on configuration (first guess) –Sets some batch environment variables –Works well in the NCAR environment, other sites require post script-generation tuning
11
www.ccsm.ucar.edu Setting up a simple case, without GUI Create new case directory under scripts, copy over test.a1 files Rename file test.a1.run to $CASE.run –Edit $CASE, $CSMROOT, $CSMDATA, $EXEROOT, $ARCROOT –Edit batch environment parameters –Edit $GRID –Edit $SETUPS –Edit $NTASKS, $NTHRDS
12
www.ccsm.ucar.edu $NTASKS, $NTHRDS, batch $NTASKS are the total number of MPI tasks for each component $NTHRDS are the number of OpenMP threads per MPI task $NTASKS*$NTHRDS = total number of processors for each component Tuning required to get optimal load balance Batch parameters should match processors used, consistency important, task_geometry (loadleveler) is very powerful
13
www.ccsm.ucar.edu Component parallelization atm : MPI, OpenMP, or MPI/OpenMP lnd : MPI, OpenMP, or MPI/OpenMP ice : MPI only, NTHRDS=1 ocn : MPI only, NTHRDS=1 cpl : OpenMP only, NTASKS=1 The data models, datm, docn, dice, dlnd, and latm : serial only, 1 processor, NTASKS=1, NTHRDS=1
14
www.ccsm.ucar.edu Main script configuration summary B case MODELS ( atm lnd ocn ice cpl) SETUPS ( atm lnd ocn ice cpl) NTASKS ( 8 2 40 8 1) NTHRDS ( 4 4 1 1 4) datm/dlnd/ocn/ice case MODELS ( atm lnd ocn ice cpl) SETUPS ( datm dlnd ocn ice cpl) NTASKS ( 1 1 64 16 1) NTHRDS ( 1 1 1 1 4)
15
www.ccsm.ucar.edu $RUNTYPE Startup - initial startup of model using arbitrary initialization –set $CASE, $BASEDATE Continue - continuation of case, bit-for-bit guaranteed, uses model restart files –set $CASE Branch - start new case as a bit-for-bit continuation of another case, uses model restart files, requires continuous date –set $CASE, $REFCASE, $REFDATE Hybrid - start new case, not bit-for-bit continuation, uses model initial files in atm and land, can change starting date –set $CASE,$BASEDATE,$REFCASE,$REFDATE
16
www.ccsm.ucar.edu Coupler namelist Stop_option: ndays, nmonths, newmonth, halfyear, newyear, newdecade Stop_n : integer (ndays, nmonths) Rest_freq : ndays, monthly, quarterly, halfyear, yearly Rest_n : integer (ndays) Diag_freq : daily, weekly, biweekly, monthly, quarterly, yearly, ndays Diag_n : integer (ndays) info_bcheck : integer
17
www.ccsm.ucar.edu Data Sets Types –Grid files, binary –Namelist input, ascii –Initial datasets, binary/netcdf –Restart datasets, binary –History datasets, netcdf –Log files, ascii inputdata directory –This is usually pointed to by $CSMDATA
18
www.ccsm.ucar.edu Data Flow, Input Everything is copied to $EXEROOT Tools and scripts attempt to automate most of the “get input files” Main script variables include $CSMDATA, $LFSINP, $LMSINP, $MACINP, $RFSINP, $RMSINP $EXEROOT Mass Store $ARCROOT/restart $CSMDATA = inputdata scripts/$CASE Setup scripts
19
www.ccsm.ucar.edu Data Flow, Output Output files are moved out of $EXEROOT Harvesting is a separate process Writing of restart files coordinated by the coupler Writing of history files is not coordinated between components, monthly average is default Main script variables include $LMSOUT, $MACOUT, $RFSOUT $EXEROOT Mass Store $ARCROOT Scripts archiving harvesting
20
www.ccsm.ucar.edu Log Files Each component produces a log file, $model.log.$LID $LID is a system date stamp Date stamps are the same on all log files for a run Log files are written into the $EXEROOT/$model directories during execution Log files are copied to $SCRIPTS/logs at the end of a run There are separate stdout and stderr that sometimes contain output information
21
www.ccsm.ucar.edu Archiving, ccsm_archive Means moving model output to a separate area on a local disk, ccsm_archive Local disk area is set by $ARCROOT in the main script Benefits –Allows separation of running and harvesting –Mass storage availability does not prevent continued execution of the model –Allows users to run in volatile temporary space –Supports simple harvesting in a clustered machine environment (like nirvana)
22
www.ccsm.ucar.edu Harvesting, $CASE.har Means copying model output to the local mass store Separate script in scripts/$CASE, $CASE.har Typically submitted in batch, can also be run interactively Submitted by main script after model run, off by default Sources ccsm_joe for important environment variables Harvests all files in $ARCROOT/{atm,lnd,ocn,ice,cpl} Verifies accurate copy on mass store before removing Can scp files to remote machines
23
www.ccsm.ucar.edu Exact Restart CCSM can stop and restart exactly The coupler controls the frequency of restart file writes Restart files guarantee bit-for-bit continuity at a checkpoint boundary rpointer files are updated in the scripts/$CASE directory after each run
24
www.ccsm.ucar.edu Restart file management (1) ccsm_archive –In scripts/$CASE –Called from main script after model run is complete, commented out by default –$ARCROOT/restart contains the latest full set of restart files –ccsm_archive copies full set of restart datasets into $ARCROOT/restart after each run –ccsm_archive then tars up that restart set into the $ARCROOT/restart.tars directory –These tar files can be large, regular clean up required
25
www.ccsm.ucar.edu Restart file management (2) ccsm_getrestart –In scripts/tools –Called from main script before model run starts, commented out by default –Copies the latest set of restart files from $ARCROOT/restart to the appropriate directories To “backup” model run to previous model date –Assumes both ccsm_archive and ccsm_getrestart have been active in the main script –Delete all files in $ARCROOT/restart –Untar an $ARCROOOT/restart.tars file into $ARCROOT/restart –Resubmit
26
www.ccsm.ucar.edu Auto-Resubmit RESUBMIT file in scripts/$CASE directory –contains a single integer –If the integer is >0, main script resubmits itself and decrements the integer Runaway jobs –FIRST! set value in RESUBMIT file to 0 –Attempt to kill running jobs
27
www.ccsm.ucar.edu Production Modify coupler namelist in cpl.setup.csh, set run length and restart frequency, turn down diagnostic frequency, set info_bcheck to 0. Run a startup, hybrid, or branch case $RUNTYPE Transition to continue $RUNTYPE Turn on archiving, harvesting, and ccsm_getrestart Edit RESUBMIT file to initiate auto- resubmission
28
www.ccsm.ucar.edu Monitoring a run Monitor the batch jobs using llq, bjobs, qstat Verify that runs complete successfully, check for timing information at the end of a log file Tail -f $EXEROOT/cpl/cpl.log* If runs are not succeeding, –tail each log file –grep for ENDRUN in atm and lnd log files –Check stdout and stderr files for component messages or system messages –Look for core files in $EXEROOT/$model –Look for zero length files in $EXEROOT/$model –Check email
29
www.ccsm.ucar.edu Modifying source code Modifying files in the ccsm models directory is not recommended Create directories under scripts/$CASE –src.atm, src.lnd, src.ocn, src.ice, src.cpl –Copy subset of model source code to these directories and modify it –Has highest priority with respect to build Benefits include –Release source code remains unmodified and available –Allows implementation of case dependent code modifications
30
www.ccsm.ucar.edu Multiple Machine Support Should run on blackforest, babyblue, and ute “out of the box” “Other” machines include seaborg, nirvana, eagle, falcon, cheetah Supported platforms are indicated in $OS, $SITE, $MACH, $ARCH environment variables in the main script See also scripts/tools/test.a1.mods.$MACH for suggested changes to test.a1.run for “other” machines.
31
www.ccsm.ucar.edu Running on a “New” Machine Main script –Set batch queue commands –Add new $OS, $SITE, $MACH, $ARCH options –Set standard CCSM path names, $CSMROOT, … –Harvester submission issues –Set data movement variables, $LMSINP, … Harvester script –May require modification Tools –May need to modify ccsm_msread, ccsm_mswrite Build –Modify models/bld/Macros.$OS file
32
www.ccsm.ucar.edu ccsm_joe Created by main script Updated every time the main script runs Case dependent Records important ccsm environment variables Can be “sourced” by other scripts to inherit ccsm environment variables
33
www.ccsm.ucar.edu Interactive/Batch Issues Can run main script interactively Typically used to build and pre-stage initial data Uncomment “exit” command in main script to stop the script before script starts ccsm execution Batch environment highly site dependent –NQS –Loadleveler –LSF –PBS
34
www.ccsm.ucar.edu Common Errors (1) Model won’t build –Try rebuilding clean –Remove all obj directories, these are $OBJROOT/model/obj which is normally equivalent to $EXEROOT/model/obj –When rebuilding, make sure $SETBLD is true in main script Model won’t continue due to restart problem –Determine cause of problem; quota, hardware, script, zero length files, rpointer problems –Fix if possible –Back up to latest “good” restart dataset –Rerun
35
www.ccsm.ucar.edu Common Errors (2) Ice model stops due to mp transport error –Double ndte in ice.setup.csh ice model namelist –Back up to latest “good” restart dataset –Run past previous stop date –Reset ndte value Ocean model non-convergence –Add about 10% to the number of model timesteps/hour in ocn.setup.csh, DT_COUNT –Back up to latest “good” restart dataset –Run past previous stop date –Reset DT_COUNT –Non-convergence on first timestep is special case
36
www.ccsm.ucar.edu Tools Under scripts/tools –ccsm_getfile : hierarchical search for file –ccsm_getinput : hierarchical search for input file –ccsm_msread : copies a file from local mass store –ccsm_mswrite : copies a file to local mass store –ccsm_checkenvs : echo ccsm environment variables, used to created ccsm_joe –ccsm-getrestart : copies restart files from $ARCROOT/restart to appropriate $EXEROOT and scripts/$CASE directories
37
www.ccsm.ucar.edu Performance This is complicated! Issues –Performance of components and system as a function of resolution and configuration –Scalability of individual components, scaling efficiency of individual components –Task/Thread counts –Components sharing nodes, overloading nodes with multiple components, overloading threads, overloading tasks –Load balance of coupled system
38
www.ccsm.ucar.edu Component Timings
39
www.ccsm.ucar.edu CCSM Load Balancing 40 ocean 32 atm 16 ice 12 land 04 cpl 104 total 9.43.0 6.215.0 8.640.4 53.2 10.0 55 32 Timings in seconds per day 5 processors
40
www.ccsm.ucar.edu Component/Hardware layout Machine, set of nodes Nodes, group of processors that share memory Processors, individual computing elements General rules –Do not oversubscribe processors, place only 1 MPI task or 1 thread on each processor –Minimize the number of nodes used for a given component and processor requirement –Multiple components can share a node as long as there is no oversubscription of processors –Test several decompositions, layouts, task/thread combinations to try to optimize performance
41
www.ccsm.ucar.edu Summary CCSM is a complicated multi-executable climate model, expect there to be “spin-up” time CCSM is a scientific research code There are many possible components, configurations, platforms, and resolutions; we are unable to test everything Users are responsible for validating their science NCAR can help with software/configuration problems, ccsm@ucar.edu ccsm@ucar.edu Please report bugs, fixes, improvements, and ports to new hardware, so we can incorporate those changes! ccsm@ucar.edu ccsm@ucar.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.