Changing Run Length#

How to set run length for long simulations using variables $STOP_OPTION, $STOP_N and $RESUBMIT?

1. Number of submissions and run length#

Recall that we can use STOP_N and STOP_OPTION to control the run length of each batch job submission.

A typical long model simulation (say you want to run the model for 100 years) is comprised of many job submissions. This is because we have limited batch wallclock time for each job submission. For example, on Cheyenne, the regular queue wallclock limit is 12 hours.

We can specify the number of times to resubmit the run using the $RESUBMIT variable in env_run.xml to complete the long run.


Evaluate your understanding

The tutorial version of the CESM model on Cheyenne simulates ~10 model years per wallclock day. The maximum wallclock request is 12 hours. If you want to run the model for 100 years, what values should be set for STOP_OPTION, STOP_N and RESUBMIT?

Hint!
  • How to set STOP_N and STOP_OPTION for each submission, given the wallclock limit?

  • How many times to resubmit the job to reach the 100 years?

  • The number of total submission = the initial submission + the number of resubmission.

Click here for the solution Assume we want to use the full 12 hours for each job submission.

The model runs 10 years / wallclock day, which means that 12 hours would give us 5 years per job submission.

For a total of 100 years, we will need 20 submissions.

STOP_OPTION='nyears', STOP_N=5, RESUBMIT=19

so that initial run of 5 years + (19 resubmits x 5 years per job) = 100 years.



2. RESUBMIT and CONTINUE_RUN#

In the exercise above, the first submission is the initial run, where CONTINUE_RUN is by default set to FALSE. When you want to continue the run after running the first 5 years, you will need to tell the model to continue by setting CONTINUE_RUN=TRUE.

If you have set RESUBMIT>0, your script will automatically change CONTINUE_RUN=TRUE after completion of the first submission for all subsequent submissions into the queue.

Evaluate your understanding

In the previous exercise, if we have set RESUBMIT=19 before the initial run, what are the value of CONTINUE_RUN and RESUBMIT at the time of:

  • the initial submission of 5 years

  • the next submission of 5 years

  • the 3rd run (2nd resubmission) of 5 years?

Click here for the solution
  • the initial submission of 5 years:

    CONTINUE_RUN=FALSE, RESUBMIT=19 
    

    because when the run is first initialized, CONTINUE_RUN=FALSE.

  • the next submission of 5 years:

    CONTINUE_RUN=TRUE, RESUBMIT=18 
    

    because RESUBMIT>0, CONTINUE_RUN will automatically switch to TRUE after completing the initial run.

  • the 3rd run (2nd resubmission) of 5 years:

    CONTINUE_RUN=TRUE, RESUBMIT=17
    

    because CONTINUE_RUN stays to be TRUE, RESUBMIT decreases by 1.