First Test Runs · 0.1° JRA BGC Run

Per the discussion in the model configuration thread, I've got a 1-day test run in the queue at cheyenne. It differs from our eventual production run in the following ways (probably not an all-inclusive list):

I'm running for a single day just to make sure everything actually works -- future timing runs will be O(1 month) once I have a starting point for estimating wallclock needs

Michael Levy (Apr 25 2020 at 21:07):

162: istep1, my_task, iblk =       17665         162           1
162: category n =            1
162: Global block:         430
162: Global i and j:        1047         198
162: Lat, Lon:  -70.1472839712437       -5.35000000000000
162: aice:  3.442529845610359E-002
162: n:            1 aicen:   2.803949195110218E-003
162: ice: Vertical thermo error
162: ERROR: ice: Vertical thermo error
162:Image              PC                Routine            Line        Source
162:cesm.exe           000000000167EA9A  Unknown               Unknown  Unknown
162:cesm.exe           0000000000D768F0  shr_abort_mod_mp_         114  shr_abort_mod.F90
162:cesm.exe           00000000005058C4  ice_exit_mp_abort          46  ice_exit.F90
162:cesm.exe           000000000070427E  ice_step_mod_mp_s         578  ice_step_mod.F90
162:cesm.exe           00000000005E1748  cice_runmod_mp_ci         172  CICE_RunMod.F90
162:cesm.exe           00000000004F7559  ice_comp_mct_mp_i         584  ice_comp_mct.F90
162:cesm.exe           0000000000428324  component_mod_mp_         737  component_mod.F90
162:cesm.exe           0000000000409D5C  cime_comp_mod_mp_        2603  cime_comp_mod.F90
162:cesm.exe           0000000000427F5C  MAIN__                    133  cime_driver.F90
162:cesm.exe           0000000000407D22  Unknown               Unknown  Unknown
162:libc.so.6          00002AD9A84DA6E5  __libc_start_main     Unknown  Unknown
162:cesm.exe           0000000000407C29  Unknown               Unknown  Unknown

Case root:

/glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/tests/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.20200424_143439_sexhyo

Rundir: /glade/scratch/mlevy/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.20200424_143439_sexhyo/run

I'm currently waiting on two SMS_Ld1.TL319_t13.GIAF_JRA_HR.cheyenne_intel.pop-performance_eval tests to run, one in the same sandbox and one in a CESM 2.1 sandbox. It doesn't look like there were any CICE mods when JRA was updated in 2.1, but if 2.1 runs successfully and the 2.2 sandbox doesn't then maybe it'll help us track down something in CIME or POP that didn't get merged in correctly.

@Keith Lindsay I was under the impression that nobody has run GIAF_JRA_HR in the latest 2.1, is that accurate? So it's possible that we just need to talk to @David Bailey about how to configure CICE correctly?

Keith Lindsay (Apr 25 2020 at 21:28):

The usual fix for ERROR: ice: Vertical thermo error is the following patch to
$SRCROOT/components/cice/src/source/ice_therm_vertical.F90,

--- /glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/components/cice/src/source/ice_therm_vertical.F90   2020-04-17 16:10:05.319640578 -0600
+++ ./ice_therm_vertical.F90    2020-04-25 15:20:33.546097802 -0600
@@ -2387,7 +2387,7 @@ subroutine conservation_check_vthermo(nx
                - fsnow(i,j)*Lfresh - fadvocn(i,j)) * dt
          ferr = abs(efinal(ij)-einit(ij)-einp) / dt

-         if (ferr > ferrmax) then
+         if (ferr > 1.1_dbl_kind*ferrmax) then
             l_stop = .true.
             istop = i
             jstop = j

This just loosens the tolerance on the energy conservation.
It doesn't change answers, except for allowing runs to proceed.
It looks like this was added to CESM_CICE5 master on Dec 23, 2019, and you're using a tag from Nov 6, 2019.

Keith Lindsay (Apr 25 2020 at 21:38):

Michael Levy (Apr 25 2020 at 21:42):

Isn't that the default when using leapfrog rather than Robert? (Though I suppose this means I won't actually be running POP in my test due to waiting a coupling interval before starting? That's a bummer.)

Michael Levy (Apr 25 2020 at 22:13):

So I'll test this CICE change with an SMS_Ld2.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval test so that POP gets a chance to take some time steps (it initialized fine in the SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval tests that caused CICE to crash); test is building an submitting now, I'll check in on it tomorrow or Monday and let everyone know how it goes

Keith Lindsay (Apr 25 2020 at 22:13):

When you run with leapfrog, you need to use at least 3 ocean timesteps (2 full+1 half) per coupling interval.
If you have OCN_NCPL=24, this means a timestep of 24 minutes or less.
One reason for not doing that with the gx1v[67] grids is that
it's a much shorter timestep than the 1 hr target timestep for that resolution.

But for tx0.1v[23] grid, the target timestep is a few minutes.
So it's not problematic to use higher OCN_NCPL.

Alper is using ATM_NCPL=144, OCN_NCPL=48 in /glade/scratch/altuntas/temp/g.e20.G.TL319_t13.control.001_hfreq.
I suggest asking him him to point us to a trusted hi-res JRA case.
There are other settings that we'll want to vet also,
but I'm focusing on NCPL settings initially because these seem likely to impact timing tests.

I think it's a great step forward that your test run completed ocean model, and MARBL, initialization.

Michael Levy (Apr 25 2020 at 22:17):

Thanks for pointing me to Alper's case root - I'll send him an email and see if he can point me to documentation of any changes that haven't made it into the compset definition yet

Michael Levy (Apr 26 2020 at 04:02):

Putting the 1.1 multiple in front of ferrmax just reduces the number of tasks reporting the error, it looks like the ice.log is getting to the exact same spot, but the first error in cesm.log is now

664: istep1, my_task, iblk =       17665         664           1
664: category n =            1
664: Global block:        1015
664: Global i and j:         539         401
664: Lat, Lon:  -61.3373858360807       -56.1500000000000
664: aice:  1.377508213909708E-003
664: n:            1 aicen:   4.334067408077150E-004
664: ice: Vertical thermo error
664: ERROR: ice: Vertical thermo error
664:Image              PC                Routine            Line        Source
664:cesm.exe           000000000167EA9A  Unknown               Unknown  Unknown
664:cesm.exe           0000000000D768F0  shr_abort_mod_mp_         114  shr_abort_mod.F90
664:cesm.exe           00000000005058C4  ice_exit_mp_abort          46  ice_exit.F90
664:cesm.exe           000000000070427E  ice_step_mod_mp_s         578  ice_step_mod.F90
664:cesm.exe           00000000005E1748  cice_runmod_mp_ci         172  CICE_RunMod.F90
664:cesm.exe           00000000004F7559  ice_comp_mct_mp_i         584  ice_comp_mct.F90
664:cesm.exe           0000000000428324  component_mod_mp_         737  component_mod.F90
664:cesm.exe           0000000000409D5C  cime_comp_mod_mp_        2603  cime_comp_mod.F90
664:cesm.exe           0000000000427F5C  MAIN__                    133  cime_driver.F90
664:cesm.exe           0000000000407D22  Unknown               Unknown  Unknown
664:libc.so.6          00002B2EE17746E5  __libc_start_main     Unknown  Unknown
664:cesm.exe           0000000000407C29  Unknown               Unknown  Unknown

and the ice_step_mod appears fewer times in cesm.log. However, I added CIME mods to change the coupling frequency to what Alper is using. That test (

/glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/tests/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.003

) is in the queue and if it runs before I go to bed I'll provide one more update otherwise I'll check on it on Monday.

Alper's run includes a few other modifications that I have NOT added to our run. These are just the ones that I noticed:
1. No ideal age tracer (probably negligible for us, since we're adding the 32 BGC tracers)
2. no CVMix (but turning tidal mixing on)
3. h_upper = 20.0 in user_nl_pop
4. ndtd = 2 in user_nl_cice

Is there any chance that changing ndtd from 3 (default value for the 0.1 degree grids) to 2 would fix this Vertical thermo error issue?

Michael Levy (Apr 26 2020 at 04:05):

Looks like my .003 test finished, but nothing beyond initialization in the ocean log:

$ tail -n 6 ocn.log.1953646.chadmin1.ib0.cheyenne.ucar.edu.200425-220050
========================================================================


 End of initialization

========================================================================

So I definitely need to get past the CICE error to figure out if we'll run into POP errors

David Bailey (Apr 27 2020 at 15:16):

This is more serious than a small energy conservation error. What is ICE_NCPL/ATM_NCPL? What is ndtd (xndt_dyn)? What version of CESM are you running here? How long into the run before this blows up? How are you initializing POP/CICE?

Michael Levy (Apr 27 2020 at 16:50):

@David Bailey This is cesm2_2_beta04; out of the box ICE_NCPL = ATM_NCPL = 24 and OCN_NCPL=1, but the error also occurs when I change to ICE_NCPL = ATM_NCPL = 144 and OCN_NCPL = 24. I think ndtd = 3, I was toying with the idea of setting it to 2 instead to match Alper's run but haven't done so yet. I'll double-check initialization, but I think it's a startup run; it's GIAF_JRA_HR compset

Michael Levy (Apr 27 2020 at 20:24):

@David Bailey oh, I noticed two other differences as with Alper's older high-res run: he has ktherm=1 and r_snw=1.00, while CICE currently defaults to ktherm=2 and r_snw=1.25for the high-res run. So I'm testing with his values of those three variables (ndtd, ktherm, r_snw)

David Bailey (Apr 27 2020 at 20:59):

Oh! He was using the CICE4 physics, but we are trying to use the CICE5 physics for the high res now. @Fred Castruccio has some runs with CICE5 physics on. I thought Alper had done some with the CICE5 physics too? What is the CICE initial file? It looks like it is crashing right away.

Michael Levy (Apr 27 2020 at 21:01):

ice_ic = "/glade/p/cesmdata/cseg/inputdata/ice/cice/g.e11.G.T62_t12.002.cice.r.0016-01-01-00000.nc"

Michael Levy (Apr 27 2020 at 22:00):

Off a conference call, so longer answer: Alper does have a run out of the CESM 2.1 sandbox. It's possible that something has gone wonky in the merge from CESM 2.1 -> 2.2 (the CIME merge had some issues that were resolved in a later PR, though something else could have slipped through the cracks), or it could be a difference in CICE in the two branches. Or, as you pointed out, it could be CICE4 vs CICE5. He sent me the script he used to set up the run, which includes

./xmlchange --id CICE_CONFIG_OPTS --val "-phys cice4 -trage 0"

and also some source mods for CICE from /glade/p_old/work/altuntas/cesm.sourcemods/cesm2_0_alpha08b.jra/src.cice/:

$ ls SourceMods/src.cice/
ice_import_export.F90  namelist_definition_cice.xml  README

So if @Fred Castruccio has CICE5 working more out-of-the-box, that would be great :)

Michael Levy (Apr 28 2020 at 03:33):

/glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/tests/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.006

was run with CICE4 and still aborts on what appears to be the first CICE time step:

546: istep1, my_task, iblk =       17665         546           1
546: category n =            1
546: Global block:         897
546: Global i and j:        3462         321
546: Lat, Lon:  -64.9490236870987       -123.850000000000
546: aice:  1.053948729398668E-003
546: n:            1 aicen:   3.055358416585096E-004
546: ice: Vertical thermo error
546: ERROR: ice: Vertical thermo error
546:Image              PC                Routine            Line        Source
546:cesm.exe           000000000167377A  Unknown               Unknown  Unknown
546:cesm.exe           0000000000D6B5D0  shr_abort_mod_mp_         114  shr_abort_mod.F90
546:cesm.exe           00000000005058C4  ice_exit_mp_abort          46  ice_exit.F90
546:cesm.exe           00000000006FB853  ice_step_mod_mp_s         578  ice_step_mod.F90
546:cesm.exe           00000000005DC028  cice_runmod_mp_ci         172  CICE_RunMod.F90
546:cesm.exe           00000000004F7559  ice_comp_mct_mp_i         584  ice_comp_mct.F90
546:cesm.exe           0000000000428324  component_mod_mp_         737  component_mod.F90
546:cesm.exe           0000000000409D5C  cime_comp_mod_mp_        2603  cime_comp_mod.F90
546:cesm.exe           0000000000427F5C  MAIN__                    133  cime_driver.F90
546:cesm.exe           0000000000407D22  Unknown               Unknown  Unknown
546:libc.so.6          00002AD921E736E5  __libc_start_main     Unknown  Unknown
546:cesm.exe           0000000000407C29  Unknown               Unknown  Unknown

Alper's source mods in ice_import_export.F90 prevent several fields from being negative:

140,141c140,141
<              Qa   (i,j,iblk)   = aflds(i,j, 6,iblk)
<              rhoa (i,j,iblk)   = aflds(i,j, 7,iblk)
---
>              Qa   (i,j,iblk)   = max(aflds(i,j, 6,iblk),c0)
>              rhoa (i,j,iblk)   = max(aflds(i,j, 7,iblk),c0)
143,146c143,146
<              swvdr(i,j,iblk)   = aflds(i,j, 9,iblk)
<              swidr(i,j,iblk)   = aflds(i,j,10,iblk)
<              swvdf(i,j,iblk)   = aflds(i,j,11,iblk)
<              swidf(i,j,iblk)   = aflds(i,j,12,iblk)
---
>              swvdr(i,j,iblk)   = max(aflds(i,j, 9,iblk),c0)
>              swidr(i,j,iblk)   = max(aflds(i,j,10,iblk),c0)
>              swvdf(i,j,iblk)   = max(aflds(i,j,11,iblk),c0)
>              swidf(i,j,iblk)   = max(aflds(i,j,12,iblk),c0)
148,149c148,149
<              frain(i,j,iblk)   = aflds(i,j,14,iblk)
<              fsnow(i,j,iblk)   = aflds(i,j,15,iblk)
---
>              frain(i,j,iblk)   = max(aflds(i,j,14,iblk),c0)
>              fsnow(i,j,iblk)   = max(aflds(i,j,15,iblk),c0)

(I snipped out some OMP statements since NTHRDS_ICE=1 and also some diagnostic write() statements).

Michael Levy (Apr 28 2020 at 03:57):

Adding the above to CICE's ice_import_export.F90 kept CICE running past the first time step (@David Bailey does this hint at a problem in either the forcing data or maps, or does it seem like a reasonable work-around?). Probably not a surprise, getting past the first time step just let us find a runtime error in POP:

1897: ecosys_driver:ecosys_driver_set_interior: NaN in dtracer_module, (i,j,k)=(
1897:        1497 ,        1339 ,           1 )
1897: (lon,lat)=(   39.6500000000000      ,   15.4607237580609      )
1897: PO4   0.211717655882239                            NaN
1897: NO3    1.27333082631230                            NaN
1897: SiO3    2.19131705164909                            NaN
1897: NH4   6.350598431893197E-002                       NaN
1897: Fe   2.437298209344740E-003                       NaN
1897: Lig   5.812391901820265E-004                       NaN
1897: O2    203.535365092392                            NaN
1897: DIC    2105.65655080915                            NaN
1897: DIC_ALT_CO2    2105.65655080915                            NaN
1897: ALK    2391.51854738421                            NaN
1897: ALK_ALT_CO2    2391.51854738421                            NaN
1897: DOC    66.9323836494075         3.577910606282754E-007
1897: DON    6.16377178086313         3.247201491180935E-008
1897: DOP   0.222849434662858         4.313611068196024E-009
1897: DOPr   3.858386270617769E-002                       NaN
1897: DONr    1.86200705348168                            NaN
1897: DOCr    20.2761671622387                            NaN
1897: zooC   0.763287877298220        -2.369565062929366E-007
1897: spChl   9.526575502999188E-002   -1.353405387353486E-006
1897: spC   0.637940214131649        -9.062981087773334E-006
1897: spP   4.348255572965681E-003   -6.177406053401722E-008
1897: spFe   1.913820672397084E-005   -2.718894326335037E-010
1897: spCaCO3   2.912888670776460E-002   -4.138233322717762E-007
1897: diatChl   6.305137156249992E-003   -5.179350511652691E-008
1897: diatC   3.348946596417704E-002   -2.750990663194437E-007
1897: diatP   2.288603780628343E-004   -1.879971976568009E-009
1897: diatFe   1.004684278946992E-006   -8.252971989761424E-012
1897: diatSi   4.586396734116307E-003   -3.767492392761152E-008
1897: diazChl   3.781205504814170E-003   -2.752884200074162E-008
1897: diazC   3.042874920955905E-002   -2.215347733658101E-007
1897: diazP   2.074621591844823E-004   -1.510415922660358E-009
1897: diazFe   1.825725552617135E-006   -1.329208640226600E-011
1897: Dust Flux
1897:  2.137987307381511E-011
1897: PAR Column Fraction
1897:   1.00000000000000       0.000000000000000E+000  0.000000000000000E+000
1897:  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
1897: Surface Shortwave
1897:  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
1897:  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
1897: Potential Temperature
1897:   26.6446952819824
1897: Salinity
1897:   38.1813774108887
1897: Pressure
1897:  0.496808084874780
1897: Iron Sediment Flux
1897:  4.950889967323747E-010
1897:------------------------------------------------------------------------
1897:
1897:POP aborting...
1897: Stopping in ecosys_driver:ecosys_driver_set_interior
1897:
1897:------------------------------------------------------------------------
1897:MPT ERROR: Rank 1897(g:1897) is aborting with error code 0.
1897:   Process ID: 18849, Host: r4i2n0, Program: /glade/scratch/mlevy/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.007/bld/cesm.exe
1897:   MPT Version: HPE MPT 2.19  02/23/19 05:30:09
1897:
1897:MPT: --------stack traceback-------

/glade/scratch/mlevy/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.007/run/ has all the logs; not sure if I'll have time to look at this on Tuesday, but definitely will investigate more later in the week.

Michael Levy (Apr 28 2020 at 04:45):

To be clear about the error message, the columns are Tracer Name, Tracer Value at (i, j, k), and Tracer Tendency Returned From MARBL. So the NaN is being returned by MARBL as the computed tendency for 14 of the 32 tracers. The next set of lines are the names of the forcing fields followed by the value passed to MARBL.

Fred Castruccio (Apr 28 2020 at 05:35):

@Michael Levy I think all my HR runs use CICE4 physics. Sorry. I will double check tomorrow.

Keith Lindsay (Apr 28 2020 at 14:41):

Because we're planning to initialize the physics in this hi-res JRA BGC run from the hi-res JRA physics only run that Alper has done, I think we should match his configuration, even if it isn't the latest version of model physics.

Dan Whitt (Apr 28 2020 at 18:02):

I agree w/ Keith; use cice4 phys (I thought there was some reason we might not want to use cice5, which I can't recall).

But, I did use cice5 physics in high-res core nyf runs using modified cesm20b07 tag. This discussion reminds me of a bunch of issues that Dave and I worked through a year+ ago

The ending case and run directory with restart is here:
/glade/p/cgd/oce/people/dwhitt/nsfsubmeso/run_directories/g.e20b07.2000_DATM%NYF_SLND_CICE_POP2_DROF%NYF_SGLC_SWAV.T62_t13.hybrid.027

Michael Levy (Apr 28 2020 at 18:18):

@Dan Whitt Thanks! I'll take a look and see what mods you put in. I just got off a zoom call with @David Bailey and he helped explain the need for ice_import_export.F90 mods (the high-res runs use patch mapping, which is not positive-definite). It'll be interesting to see what other issues you guys talked about

Michael Levy (May 01 2020 at 17:50):

At this point, I'd welcome feedback on the CIME PR: do we want the NCPL updates on master? Is there any reason to hold off and see if additional CIME changes are made, or should we get this merged ASAP and submit future PRs down the line as needed? I'm leaning towards getting it merged quickly, and then relying on cesm2_2_beta05 as the basis for our experiment tag since that will also include our CICE mods.

I haven't had a chance to really dig into why MARBL is returning those NaNs. I'll try to find some time for it next week, but would appreciate any off-the-cuff theories that would be worth investigating (read: I'm not really sure where to start digging)

Keith Lindsay (May 01 2020 at 21:05):

Regarding the NaNs, I've cloned Mike's case and am trying out some hypotheses.
Based on the tracers that do and don't have NaN tendencies, I was guessing that the NaN generation is in the sinking particulate code.
The variables PO[CNP]remin_refractcontrol how much remineralized POM goes into refractory DOM.
I tried a run with those set to zero (via SourceMods) and the refractory DOM tracer tendencies are no longer NaN.
I think this confirms my guess. I'm now reviewing the sinking particulate code to try to figure out what might be going wrong.

Michael Levy (May 01 2020 at 21:10):

Thanks @Keith Lindsay -- if it would help to talk any of this out, I'm available to be a sounding board until ~5:00 today.

Keith Lindsay (May 02 2020 at 00:24):

--- /glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/cime/../components/pop/source/ecosys_driver.F90 2020-05-01 16:01:16.168534401 -0600
+++ SourceMods/src.pop/ecosys_driver.F90    2020-05-01 16:18:36.090587538 -0600
@@ -710,7 +710,7 @@ subroutine ecosys_driver_set_interior(

              marbl_instances(bid)%domain%kmt = KMT(i, c, bid)
              if (partial_bottom_cells) then
-                marbl_instances(bid)%domain%delta_z(:) = DZT(i, c, 1:km, bid)
+                marbl_instances(bid)%domain%delta_z(1:km) = DZT(i, c, 1:km, bid)
              end if

              ! --- set forcing fields ---
@@ -790,6 +790,9 @@ subroutine ecosys_driver_set_interior(
                          end if
                       end associate
                    end do
+                   if (partial_bottom_cells) then
+                      write(stdout, *) 'delta_z = ', marbl_instances(bid)%domain%delta_z(k)
+                   end if
                    call exit_POP(sigAbort, 'Stopping in ' // subname)
                 end if
              end do

1195: ecosys_driver:ecosys_driver_set_interior: NaN in dtracer_module, (i,j,k)=(
1195:        1611 ,         695 ,          46 )
1195: (lon,lat)=(   51.0500000000000      ,  -43.7616814221158      )
1195: PO4    2.04789492842180         1.233632313668388E-011
1195: NO3    29.1541177738795         7.891054589657643E-010
1195: SiO3    80.6860058275754         5.874654000518209E-009
1195: NH4   1.136311860910701E-003   -5.897866983062583E-010
1195: Fe   7.018172115361605E-004   -1.206971895609777E-012
1195: Lig   1.318561169877044E-003   -3.305152602581477E-014
1195: O2    206.443621307117        -3.318239747001380E-009
1195: DIC    2288.10693532270         1.540772284427621E-009
1195: DIC_ALT_CO2    2288.10693532270         1.540772284427621E-009
1195: ALK    2410.24191924178        -1.247831255100442E-009
1195: ALK_ALT_CO2    2410.24191924178        -1.247831255100442E-009
1195: DOC   3.061239864193014E-013   -1.617854655099471E-021
1195: DON   1.529597999753868E-014   -8.818769889268647E-023
1195: DOP   2.035138542409879E-015   -1.434084885287981E-023
1195: DOPr   2.105337715235184E-002   -1.191824074063360E-013
1195: DONr    1.17315908596378        -3.848418582486065E-012
1195: DOCr    23.5750940541949        -4.586495690265037E-011
1195: zooC  -2.145480931261657E-049    0.000000000000000E+000
1195: spChl  -3.612966596092754E-051    0.000000000000000E+000
1195: spC  -1.276973321602058E-050    0.000000000000000E+000
1195: spP  -1.107089422179871E-052    0.000000000000000E+000
1195: spFe  -5.156775088672139E-055    0.000000000000000E+000
1195: spCaCO3  -7.065586469539938E-052    0.000000000000000E+000
1195: diatChl  -5.163524977252080E-057    0.000000000000000E+000
1195: diatC  -1.550109443411441E-056    0.000000000000000E+000
1195: diatP  -1.335046195172915E-058    0.000000000000000E+000
1195: diatFe  -2.086869697838774E-061    0.000000000000000E+000
1195: diatSi  -1.238294531751651E-054    0.000000000000000E+000
1195: diazChl   8.317163243079574E-296                       NaN
1195: diazC   7.800604350267029E-158   -2.946006227443461E-164
1195: diazP   4.136366798894355E-156    0.000000000000000E+000
1195: diazFe   4.399229936014720E-178    0.000000000000000E+000
1195: Dust Flux
1195:  3.153111848031827E-012
1195: PAR Column Fraction
1195:   1.00000000000000       0.000000000000000E+000  0.000000000000000E+000
1195:  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
1195: Surface Shortwave
1195:   52.6980795125229       0.000000000000000E+000  0.000000000000000E+000
1195:  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
1195: Potential Temperature
1195:   1.86141054800521
1195: Salinity
1195:   34.7509233308407
1195: Pressure
1195:   199.228289633652
1195: Iron Sediment Flux
1195:  0.000000000000000E+000
1195: delta_z =    21185.9570000000
1195:------------------------------------------------------------------------
1195:
1195:POP aborting...
1195: Stopping in ecosys_driver:ecosys_driver_set_interior
1195:
1195:------------------------------------------------------------------------

Given how small diazChl is and that the tendency is only NaN for diazChl, this smells like a problem previously encountered, GH #50 where a denominator in an expression underflowed to zero, causing division by zero. I'll test out an idea for a fix.

Keith Lindsay (May 02 2020 at 01:27):

My test run completes a day of execution with the following additional patch to MARBL:

--- /glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/cime/../components/pop/externals/MARBL/src/marbl_interior_tendency_mod.F90  2020-04-17 16:10:35.006494595 -0600
+++ SourceMods/src.pop/marbl_interior_tendency_mod.F90  2020-05-01 18:31:38.592387537 -0600
@@ -1330,6 +1330,7 @@ subroutine compute_autotroph_photosynthe
     real(r8) :: PCphoto_subcol   ! PCphoto for a sub-column
     real(r8) :: pChl_subcol      ! Chl synth. regulation term (mg Chl/mmol N)
     real(r8) :: photoacc_subcol  ! photoacc for a sub-column
+    real(r8) :: work             ! intermediate term common to multiple expressions
     !-----------------------------------------------------------------------

     associate(                                                &
@@ -1358,15 +1359,14 @@ subroutine compute_autotroph_photosynthe
             photoacc(auto_ind,k)  = c0

             do subcol_ind = 1, PAR_nsubcols
-              if (PAR_avg(k,subcol_ind) > c0) then
-                light_lim_subcol = (c1 - exp((-c1 * alphaPI(auto_ind) * thetaC(auto_ind,k) * PAR_avg(k,subcol_ind)) &
-                                       / (PCmax + epsTinv)))
+              work = alphaPI(auto_ind) * thetaC(auto_ind,k) * PAR_avg(k,subcol_ind)
+              if (work > c0) then
+                light_lim_subcol = (c1 - exp(-work / (PCmax + epsTinv)))

                 PCphoto_subcol = PCmax * light_lim_subcol

                 ! GD 98 Chl. synth. term
-                pChl_subcol = autotroph_settings(auto_ind)%thetaN_max * PCphoto_subcol &
-                            / (alphaPI(auto_ind) * thetaC(auto_ind,k) * PAR_avg(k,subcol_ind))
+                pChl_subcol = autotroph_settings(auto_ind)%thetaN_max * PCphoto_subcol / work
                 photoacc_subcol = (pChl_subcol * PCphoto_subcol * Q / thetaC(auto_ind,k)) &
                                 * autotroph_local%Chl(auto_ind,k)

Michael Levy (May 02 2020 at 16:17):

Thanks @Keith Lindsay ! One clarification -- it looks like your first diff (ecosys_driver.F90) was compared against a base in my directory that was already partially repaired. The original line was marbl_instances(bid)%domain%delta_z(:) = DZT(i, c, :, bid), not marbl_instances(bid)%domain%delta_z(:) = DZT(i, c, 1:km, bid) (I think I was starting to fix things in my base directory while we chatted, and then I forgot to revert the change when we decided you'd test the mods... so your diff picked up my patch as well). For everyone else following the thread, it turns out that the k index of DZT runs from 0:km+1, and I had assumed it was 1:km when updating delta_z in the driver. Anyway, that fix is on the POP branch and I'll look at the MARBL fix on Monday.

Keith Lindsay (May 02 2020 at 16:26):

Good catch @Michael Levy . I think I did the copy to my SourceMods prior to your edit and subsequently didn't notice that the diff was partial.

Michael Levy (May 06 2020 at 22:25):

I've verified that @Keith Lindsay's MARBL mods are sufficient to pass the test -- I'm just running aux_pop_MARBL to ensure they don't change answers. Currently generating baselines off development; I hope to launch the comparison tests later today and have results / a new MARBL PR tomorrow.

Michael Levy (Jun 03 2020 at 14:54):

Note that this is with POP output turned off and without coccolithophores; I'm going to add coccolithophores for my next test.

Michael Levy (Jun 03 2020 at 14:56):

Also, the above numbers come from single runs rather than averaging many runs... so discrepancies due to which nodes we happened to be given on the machine may be distorting the numbers somewhat

Stream: 0.1° JRA BGC Run

Topic: First Test Runs

Michael Levy (Apr 24 2020 at 21:06):