Diagnose a runtime failure

Contents

Diagnose a runtime failure#

Exercise: Add an additional output variable

Create a case called b1850_debugging using the compset B1850 at f19_g17 resolution. Set the run length to 1 month.

Now in addition to the default monthly output, add the following output:

  • an h1 file containing daily averages of T2M and set your namelist so that there is one file per day for this daily averaged output.

Set up, build and submit your case.

Your goal is to make the model crash. And then to troubleshoot why it crashed.

Create a case named b1850_debugging using:

  • Compset: B1850C_LTso

  • Resolution: ne16pg3_t201

  • Run length: 1 month

  • Run option: --run-unsupported

Next, modify the CAM history output by adding an additional history stream that writes daily averages of the variable T2M, with one output file per day.

Set up, build, and submit the case.

Important: The model is expected to crash. This is intentional. Your task is to determine why the run failed.

Click here for hints

Tip to add a h1 file

For more information about how to add a h1 file, check the section about namelist modifications.

If you don’t have time to check the section immediately, the way to add an h1 file with daily averages of T2M and create one file per day for this daily averaged output is:

add the following lines in user nl cam

 fincl2 = ’T2M:A’
 nhtfrq = 0,-24
 mfilt = 1,1

Tips to for troubleshooting

Check the derecho queue and wait until your run doesn’t show in the queue anymore.

When your run is not in the queue anymore:

  • Go to the archive directory: can you see the history files in the archive directory? The answer should be no. Why?

  • Go to the run directory: Is there any evidence of history files or restart files being created by the run? The answer, again, should be no. This indicates that the model failed very early during initialization, before any model timesteps were completed.

Finally, inspect the log files in RUNDIR to determine what caused the crash.

Click here for the solution

# Create a new case

Create a new case b1850_debugging with the command:

cd /glade/u/home/$USER/code/my_cesm_code/cime/scripts

./create_newcase \
    --case ~/cases/b1850_debugging \
    --compset B1850C_LTso \
    --res ne16pg3_t201 \
    --run-unsupported

# Setup

Invoke case.setup with the command:

 cd ~/cases/b1850_debugging
 ./case.setup

# Customize namelists

Add the daily output of T2M by editing the file user_nl_cam and adding the lines:

 fincl2 = 'T2M:A'
 nhtfrq = 0,-24
 mfilt = 1,1

# Set run length

Set the run length to 1 month:

./xmlchange STOP_N=1,STOP_OPTION=nmonths

# Change the job queue and account number

If needed, change job queue and account number.
For instance, to run in the queue tutorial and the project number UESM0016 (you should use the project number given for this tutorial), use the command:

./xmlchange JOB_QUEUE=tutorial,PROJECT=UESM0016 --force

# Build and submit

Build the model and submit your job:

qcmd -- ./case.build
./case.submit

# Investigating the failure

Your run should crash almost immediately!!!. This is normal. The goal of the exercise is to troubleshooting.

Because the model crashed before completing initialization, you should not find any CAM history or restart files.

Instead, examine the log files in RUNDIR.

What you should find in your run directory is CESM log files. It should look like:

atm.log.6612071.desched1.260701-115945
cesm.log.6612071.desched1.260701-115945
diags.log.6612071.desched1.260701-115945
drv.log.6612071.desched1.260701-115945
glc.log.6612071.desched1.260701-115945
ice.log.6612071.desched1.260701-115945
lnd.log.6612071.desched1.260701-115945
logfile.000000.out
med.log.6612071.desched1.260701-115945
mpibind.6612071.log
ocn.log.6612071.desched1.260701-115945
rof.log.6612071.desched1.260701-115945
wav_in.log
wav.log.6612071.desched1.260701-115945

You will also see many files named:

PET*.ESMF_LogFile

These are ESMF log files written by individual MPI tasks. They are usually not the first place to look when debugging a typical CESM runtime failure.

Somewhere in the CESM log files is information about what has gone wrong, but it is often not entirely straightforward to find.

  • Often at the bottom of the log file, there are errors that are not relative to your problem because they are just demonstrating that individual processes are exiting.

  • Often the relevant error lies above this and can sometimes be found by searching for the first occurrence of ERROR or ABORT or cesm.exe.

In this case, the end of atm.log.* gives us some relevant information. Look at the very end of that file and you should see

FLDLST: T2M in fincl(1, 2) not found
ERROR: FLDLST: 1 errors found, see log

This error message tells us that T2M is not a valid history variable for CAM. The correct variable for near surface temperature is TREFHT. T2M is not a CAM history field and this has caused CESM to crash.

What you learned#

In this exercise you practiced a typical CESM debugging workflow:

  1. Verify that the run failed.

  2. Examine the log files in RUNDIR.

  3. Identify the first meaningful error message.

  4. Determine which component generated the error.

  5. Correct the underlying problem and rerun the model.

This is the same workflow you will use to diagnose many runtime errors in CESM.