Restarting a Run#

How to continue a run.


1. Restart files#

Restart files are written by each model component at intervals dictated by the driver via the setting of the env_run.xml variables, $REST_OPTION and $REST_N. The default values for these two variables are set to be the same as $STOP_OPTION and $STOP_N. In most cases, we do not modify these two variables.

Restart files allow the model to stop and then start again with bit-for-bit exact capability (i.e. the model output is exactly the same as if it had never been stopped). The driver coordinates the writing of restart files as well as the time evolution of the model. All components receive restart and stop information from the driver and write restarts or stop as specified by the driver.

Whenever a component writes a restart file, it also writes a restart pointer file of the form, rpointer.$component (i.e. rpointer.atm). The restart pointer file contains the restart filename that was just written by the component. Upon a restart, each component reads its restart pointer file to determine the filename(s) to read in order to continue the model run. As examples, the following pointer files will be created for a component set using full active model components.

  • rpointer.atm

  • rpointer.drv

  • rpointer.ice

  • rpointer.lnd

  • rpointer.rof

  • rpointer.cism

  • rpointer.ocn.ovf

  • rpointer.ocn.restart

Tips!
  1. Try using xmlquery to check the values of REST_OPTION and REST_N. What do you find?

  2. Take a look at the restart files and restart pointer files in your archive directory ($DOUT_S_ROOT/rest/yyyy-mm-dd-ssss/) or run directory ($RUNDIR). What do they look like?



2. Continue a run#

Recall that the flag variable $CONTINUE_RUN controls whether a model run is initialized (FALSE) or continues a run (TRUE).

In the case of our 1-month test run, we submited our initial job with CONTINUE_RUN = FALSE (because it was just initialized) and your RUN_TYPE (to startup, branch or hybrid). If the run has been finished and everything looks good, and we want to continue the run for another month, what do we do?

We will need to use xmlchange to change CONTINUE_RUN = TRUE and submit the run again to carry on running the model. The model will use the restart files to continue our run with a bit-for-bit match, as if it had never been stopped.


Evaluate your understanding

If we do not modify CONTINUE_RUN=TRUE and leave it as FALSE, what would happen after we submit the run again?

Click here for the solution

The model will run the previous month once again instead of carrying on to the next month!

The $CONTINUE_RUN flag will be automatically set to TRUE when the variable RESUBMIT>0. Learn more in the chapter Changing Run Length.



3. Backing up to a previous restart#

If a run encounters problems and crashes, it is extremely useful to back up to a previous restart.

You will need to find the latest restart files in the $DOUT_S_ROOT/rest/yyyy-mm-dd-ssss/ directory that was created and copy the contents of that directory into your run directory ($RUNDIR). You can then continue the run and these restarts will be used.

It is important to make sure the new rpointer.* files overwrite the previous rpointer.* files that were in $RUNDIR, or the job may not restart in the correct place.

Occasionally, when a run has problems restarting, it is because the rpointer files are out of sync with the restart files. The rpointer files are text files and can easily be edited to match the correct dates of the restart and history files. All the restart files should have the same date.