Per the discussion in the model configuration thread, I've got a 1-day test run in the queue at cheyenne. It differs from our eventual production run in the following ways (probably not an all-inclusive list):
riv_flux_shr_stream_file = 'unknown'
-- file is in process of being created, but isn't ready yetI'm running for a single day just to make sure everything actually works -- future timing runs will be O(1 month) once I have a starting point for estimating wallclock needs
My first test died in CICE:
162: istep1, my_task, iblk = 17665 162 1 162: category n = 1 162: Global block: 430 162: Global i and j: 1047 198 162: Lat, Lon: -70.1472839712437 -5.35000000000000 162: aice: 3.442529845610359E-002 162: n: 1 aicen: 2.803949195110218E-003 162: ice: Vertical thermo error 162: ERROR: ice: Vertical thermo error 162:Image PC Routine Line Source 162:cesm.exe 000000000167EA9A Unknown Unknown Unknown 162:cesm.exe 0000000000D768F0 shr_abort_mod_mp_ 114 shr_abort_mod.F90 162:cesm.exe 00000000005058C4 ice_exit_mp_abort 46 ice_exit.F90 162:cesm.exe 000000000070427E ice_step_mod_mp_s 578 ice_step_mod.F90 162:cesm.exe 00000000005E1748 cice_runmod_mp_ci 172 CICE_RunMod.F90 162:cesm.exe 00000000004F7559 ice_comp_mct_mp_i 584 ice_comp_mct.F90 162:cesm.exe 0000000000428324 component_mod_mp_ 737 component_mod.F90 162:cesm.exe 0000000000409D5C cime_comp_mod_mp_ 2603 cime_comp_mod.F90 162:cesm.exe 0000000000427F5C MAIN__ 133 cime_driver.F90 162:cesm.exe 0000000000407D22 Unknown Unknown Unknown 162:libc.so.6 00002AD9A84DA6E5 __libc_start_main Unknown Unknown 162:cesm.exe 0000000000407C29 Unknown Unknown Unknown
Case root: /glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/tests/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.20200424_143439_sexhyo
Rundir: /glade/scratch/mlevy/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.20200424_143439_sexhyo/run
I'm currently waiting on two SMS_Ld1.TL319_t13.GIAF_JRA_HR.cheyenne_intel.pop-performance_eval
tests to run, one in the same sandbox and one in a CESM 2.1 sandbox. It doesn't look like there were any CICE mods when JRA was updated in 2.1, but if 2.1 runs successfully and the 2.2 sandbox doesn't then maybe it'll help us track down something in CIME or POP that didn't get merged in correctly.
@Keith Lindsay I was under the impression that nobody has run GIAF_JRA_HR
in the latest 2.1, is that accurate? So it's possible that we just need to talk to @David Bailey about how to configure CICE correctly?
The usual fix for ERROR: ice: Vertical thermo error
is the following patch to
$SRCROOT/components/cice/src/source/ice_therm_vertical.F90,
--- /glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/components/cice/src/source/ice_therm_vertical.F90 2020-04-17 16:10:05.319640578 -0600 +++ ./ice_therm_vertical.F90 2020-04-25 15:20:33.546097802 -0600 @@ -2387,7 +2387,7 @@ subroutine conservation_check_vthermo(nx - fsnow(i,j)*Lfresh - fadvocn(i,j)) * dt ferr = abs(efinal(ij)-einit(ij)-einp) / dt - if (ferr > ferrmax) then + if (ferr > 1.1_dbl_kind*ferrmax) then l_stop = .true. istop = i jstop = j
This just loosens the tolerance on the energy conservation.
It doesn't change answers, except for allowing runs to proceed.
It looks like this was added to CESM_CICE5 master on Dec 23, 2019, and you're using a tag from Nov 6, 2019.
I'm surprised to see OCN_NCPL=1 in your test. That doesn't see right to me.
Isn't that the default when using leapfrog rather than Robert? (Though I suppose this means I won't actually be running POP in my test due to waiting a coupling interval before starting? That's a bummer.)
So I'll test this CICE change with an SMS_Ld2.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval
test so that POP gets a chance to take some time steps (it initialized fine in the SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval
tests that caused CICE to crash); test is building an submitting now, I'll check in on it tomorrow or Monday and let everyone know how it goes
When you run with leapfrog, you need to use at least 3 ocean timesteps (2 full+1 half) per coupling interval.
If you have OCN_NCPL=24, this means a timestep of 24 minutes or less.
One reason for not doing that with the gx1v[67] grids is that
it's a much shorter timestep than the 1 hr target timestep for that resolution.
But for tx0.1v[23] grid, the target timestep is a few minutes.
So it's not problematic to use higher OCN_NCPL.
Alper is using ATM_NCPL=144, OCN_NCPL=48 in /glade/scratch/altuntas/temp/g.e20.G.TL319_t13.control.001_hfreq
.
I suggest asking him him to point us to a trusted hi-res JRA case.
There are other settings that we'll want to vet also,
but I'm focusing on NCPL settings initially because these seem likely to impact timing tests.
I think it's a great step forward that your test run completed ocean model, and MARBL, initialization.
Thanks for pointing me to Alper's case root - I'll send him an email and see if he can point me to documentation of any changes that haven't made it into the compset definition yet
Putting the 1.1 multiple in front of ferrmax
just reduces the number of tasks reporting the error, it looks like the ice.log
is getting to the exact same spot, but the first error in cesm.log
is now
664: istep1, my_task, iblk = 17665 664 1 664: category n = 1 664: Global block: 1015 664: Global i and j: 539 401 664: Lat, Lon: -61.3373858360807 -56.1500000000000 664: aice: 1.377508213909708E-003 664: n: 1 aicen: 4.334067408077150E-004 664: ice: Vertical thermo error 664: ERROR: ice: Vertical thermo error 664:Image PC Routine Line Source 664:cesm.exe 000000000167EA9A Unknown Unknown Unknown 664:cesm.exe 0000000000D768F0 shr_abort_mod_mp_ 114 shr_abort_mod.F90 664:cesm.exe 00000000005058C4 ice_exit_mp_abort 46 ice_exit.F90 664:cesm.exe 000000000070427E ice_step_mod_mp_s 578 ice_step_mod.F90 664:cesm.exe 00000000005E1748 cice_runmod_mp_ci 172 CICE_RunMod.F90 664:cesm.exe 00000000004F7559 ice_comp_mct_mp_i 584 ice_comp_mct.F90 664:cesm.exe 0000000000428324 component_mod_mp_ 737 component_mod.F90 664:cesm.exe 0000000000409D5C cime_comp_mod_mp_ 2603 cime_comp_mod.F90 664:cesm.exe 0000000000427F5C MAIN__ 133 cime_driver.F90 664:cesm.exe 0000000000407D22 Unknown Unknown Unknown 664:libc.so.6 00002B2EE17746E5 __libc_start_main Unknown Unknown 664:cesm.exe 0000000000407C29 Unknown Unknown Unknown
and the ice_step_mod
appears fewer times in cesm.log
. However, I added CIME mods to change the coupling frequency to what Alper is using. That test (/glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/tests/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.003
) is in the queue and if it runs before I go to bed I'll provide one more update otherwise I'll check on it on Monday.
Alper's run includes a few other modifications that I have NOT added to our run. These are just the ones that I noticed:
1. No ideal age tracer (probably negligible for us, since we're adding the 32 BGC tracers)
2. no CVMix (but turning tidal mixing on)
3. h_upper = 20.0
in user_nl_pop
4. ndtd = 2
in user_nl_cice
Is there any chance that changing ndtd
from 3 (default value for the 0.1 degree grids) to 2 would fix this Vertical thermo error
issue?
Looks like my .003
test finished, but nothing beyond initialization in the ocean log:
$ tail -n 6 ocn.log.1953646.chadmin1.ib0.cheyenne.ucar.edu.200425-220050 ======================================================================== End of initialization ========================================================================
So I definitely need to get past the CICE error to figure out if we'll run into POP errors
This is more serious than a small energy conservation error. What is ICE_NCPL/ATM_NCPL? What is ndtd (xndt_dyn)? What version of CESM are you running here? How long into the run before this blows up? How are you initializing POP/CICE?
@David Bailey This is cesm2_2_beta04
; out of the box ICE_NCPL = ATM_NCPL = 24
and OCN_NCPL=1
, but the error also occurs when I change to ICE_NCPL = ATM_NCPL = 144
and OCN_NCPL = 24
. I think ndtd = 3
, I was toying with the idea of setting it to 2 instead to match Alper's run but haven't done so yet. I'll double-check initialization, but I think it's a startup run; it's GIAF_JRA_HR
compset
@David Bailey oh, I noticed two other differences as with Alper's older high-res run: he has ktherm=1
and r_snw=1.00
, while CICE currently defaults to ktherm=2
and r_snw=1.25
for the high-res run. So I'm testing with his values of those three variables (ndtd
, ktherm
, r_snw
)
Oh! He was using the CICE4 physics, but we are trying to use the CICE5 physics for the high res now. @Fred Castruccio has some runs with CICE5 physics on. I thought Alper had done some with the CICE5 physics too? What is the CICE initial file? It looks like it is crashing right away.
ice_ic = "/glade/p/cesmdata/cseg/inputdata/ice/cice/g.e11.G.T62_t12.002.cice.r.0016-01-01-00000.nc"
Off a conference call, so longer answer: Alper does have a run out of the CESM 2.1 sandbox. It's possible that something has gone wonky in the merge from CESM 2.1 -> 2.2 (the CIME merge had some issues that were resolved in a later PR, though something else could have slipped through the cracks), or it could be a difference in CICE in the two branches. Or, as you pointed out, it could be CICE4 vs CICE5. He sent me the script he used to set up the run, which includes
./xmlchange --id CICE_CONFIG_OPTS --val "-phys cice4 -trage 0"
and also some source mods for CICE from /glade/p_old/work/altuntas/cesm.sourcemods/cesm2_0_alpha08b.jra/src.cice/
:
$ ls SourceMods/src.cice/ ice_import_export.F90 namelist_definition_cice.xml README
So if @Fred Castruccio has CICE5 working more out-of-the-box, that would be great :)
/glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/tests/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.006
was run with CICE4 and still aborts on what appears to be the first CICE time step:
546: istep1, my_task, iblk = 17665 546 1 546: category n = 1 546: Global block: 897 546: Global i and j: 3462 321 546: Lat, Lon: -64.9490236870987 -123.850000000000 546: aice: 1.053948729398668E-003 546: n: 1 aicen: 3.055358416585096E-004 546: ice: Vertical thermo error 546: ERROR: ice: Vertical thermo error 546:Image PC Routine Line Source 546:cesm.exe 000000000167377A Unknown Unknown Unknown 546:cesm.exe 0000000000D6B5D0 shr_abort_mod_mp_ 114 shr_abort_mod.F90 546:cesm.exe 00000000005058C4 ice_exit_mp_abort 46 ice_exit.F90 546:cesm.exe 00000000006FB853 ice_step_mod_mp_s 578 ice_step_mod.F90 546:cesm.exe 00000000005DC028 cice_runmod_mp_ci 172 CICE_RunMod.F90 546:cesm.exe 00000000004F7559 ice_comp_mct_mp_i 584 ice_comp_mct.F90 546:cesm.exe 0000000000428324 component_mod_mp_ 737 component_mod.F90 546:cesm.exe 0000000000409D5C cime_comp_mod_mp_ 2603 cime_comp_mod.F90 546:cesm.exe 0000000000427F5C MAIN__ 133 cime_driver.F90 546:cesm.exe 0000000000407D22 Unknown Unknown Unknown 546:libc.so.6 00002AD921E736E5 __libc_start_main Unknown Unknown 546:cesm.exe 0000000000407C29 Unknown Unknown Unknown
Alper's source mods in ice_import_export.F90
prevent several fields from being negative:
140,141c140,141 < Qa (i,j,iblk) = aflds(i,j, 6,iblk) < rhoa (i,j,iblk) = aflds(i,j, 7,iblk) --- > Qa (i,j,iblk) = max(aflds(i,j, 6,iblk),c0) > rhoa (i,j,iblk) = max(aflds(i,j, 7,iblk),c0) 143,146c143,146 < swvdr(i,j,iblk) = aflds(i,j, 9,iblk) < swidr(i,j,iblk) = aflds(i,j,10,iblk) < swvdf(i,j,iblk) = aflds(i,j,11,iblk) < swidf(i,j,iblk) = aflds(i,j,12,iblk) --- > swvdr(i,j,iblk) = max(aflds(i,j, 9,iblk),c0) > swidr(i,j,iblk) = max(aflds(i,j,10,iblk),c0) > swvdf(i,j,iblk) = max(aflds(i,j,11,iblk),c0) > swidf(i,j,iblk) = max(aflds(i,j,12,iblk),c0) 148,149c148,149 < frain(i,j,iblk) = aflds(i,j,14,iblk) < fsnow(i,j,iblk) = aflds(i,j,15,iblk) --- > frain(i,j,iblk) = max(aflds(i,j,14,iblk),c0) > fsnow(i,j,iblk) = max(aflds(i,j,15,iblk),c0)
(I snipped out some OMP statements since NTHRDS_ICE=1
and also some diagnostic write()
statements).
Adding the above to CICE's ice_import_export.F90
kept CICE running past the first time step (@David Bailey does this hint at a problem in either the forcing data or maps, or does it seem like a reasonable work-around?). Probably not a surprise, getting past the first time step just let us find a runtime error in POP:
1897: ecosys_driver:ecosys_driver_set_interior: NaN in dtracer_module, (i,j,k)=( 1897: 1497 , 1339 , 1 ) 1897: (lon,lat)=( 39.6500000000000 , 15.4607237580609 ) 1897: PO4 0.211717655882239 NaN 1897: NO3 1.27333082631230 NaN 1897: SiO3 2.19131705164909 NaN 1897: NH4 6.350598431893197E-002 NaN 1897: Fe 2.437298209344740E-003 NaN 1897: Lig 5.812391901820265E-004 NaN 1897: O2 203.535365092392 NaN 1897: DIC 2105.65655080915 NaN 1897: DIC_ALT_CO2 2105.65655080915 NaN 1897: ALK 2391.51854738421 NaN 1897: ALK_ALT_CO2 2391.51854738421 NaN 1897: DOC 66.9323836494075 3.577910606282754E-007 1897: DON 6.16377178086313 3.247201491180935E-008 1897: DOP 0.222849434662858 4.313611068196024E-009 1897: DOPr 3.858386270617769E-002 NaN 1897: DONr 1.86200705348168 NaN 1897: DOCr 20.2761671622387 NaN 1897: zooC 0.763287877298220 -2.369565062929366E-007 1897: spChl 9.526575502999188E-002 -1.353405387353486E-006 1897: spC 0.637940214131649 -9.062981087773334E-006 1897: spP 4.348255572965681E-003 -6.177406053401722E-008 1897: spFe 1.913820672397084E-005 -2.718894326335037E-010 1897: spCaCO3 2.912888670776460E-002 -4.138233322717762E-007 1897: diatChl 6.305137156249992E-003 -5.179350511652691E-008 1897: diatC 3.348946596417704E-002 -2.750990663194437E-007 1897: diatP 2.288603780628343E-004 -1.879971976568009E-009 1897: diatFe 1.004684278946992E-006 -8.252971989761424E-012 1897: diatSi 4.586396734116307E-003 -3.767492392761152E-008 1897: diazChl 3.781205504814170E-003 -2.752884200074162E-008 1897: diazC 3.042874920955905E-002 -2.215347733658101E-007 1897: diazP 2.074621591844823E-004 -1.510415922660358E-009 1897: diazFe 1.825725552617135E-006 -1.329208640226600E-011 1897: Dust Flux 1897: 2.137987307381511E-011 1897: PAR Column Fraction 1897: 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000 1897: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 1897: Surface Shortwave 1897: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 1897: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 1897: Potential Temperature 1897: 26.6446952819824 1897: Salinity 1897: 38.1813774108887 1897: Pressure 1897: 0.496808084874780 1897: Iron Sediment Flux 1897: 4.950889967323747E-010 1897:------------------------------------------------------------------------ 1897: 1897:POP aborting... 1897: Stopping in ecosys_driver:ecosys_driver_set_interior 1897: 1897:------------------------------------------------------------------------ 1897:MPT ERROR: Rank 1897(g:1897) is aborting with error code 0. 1897: Process ID: 18849, Host: r4i2n0, Program: /glade/scratch/mlevy/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.007/bld/cesm.exe 1897: MPT Version: HPE MPT 2.19 02/23/19 05:30:09 1897: 1897:MPT: --------stack traceback-------
/glade/scratch/mlevy/SMS_Ld1.TL319_t13.GIAFECO_JRA_HR.cheyenne_intel.pop-performance_eval.007/run/
has all the logs; not sure if I'll have time to look at this on Tuesday, but definitely will investigate more later in the week.
To be clear about the error message, the columns are Tracer Name
, Tracer Value at (i, j, k)
, and Tracer Tendency Returned From MARBL
. So the NaN
is being returned by MARBL as the computed tendency for 14 of the 32 tracers. The next set of lines are the names of the forcing fields followed by the value passed to MARBL.
@Michael Levy I think all my HR runs use CICE4 physics. Sorry. I will double check tomorrow.
Because we're planning to initialize the physics in this hi-res JRA BGC run from the hi-res JRA physics only run that Alper has done, I think we should match his configuration, even if it isn't the latest version of model physics.
I agree w/ Keith; use cice4 phys (I thought there was some reason we might not want to use cice5, which I can't recall).
But, I did use cice5 physics in high-res core nyf runs using modified cesm20b07 tag. This discussion reminds me of a bunch of issues that Dave and I worked through a year+ ago
The ending case and run directory with restart is here:
/glade/p/cgd/oce/people/dwhitt/nsfsubmeso/run_directories/g.e20b07.2000_DATM%NYF_SLND_CICE_POP2_DROF%NYF_SGLC_SWAV.T62_t13.hybrid.027
@Dan Whitt Thanks! I'll take a look and see what mods you put in. I just got off a zoom call with @David Bailey and he helped explain the need for ice_import_export.F90
mods (the high-res runs use patch mapping, which is not positive-definite). It'll be interesting to see what other issues you guys talked about
I'm done with the first wave of testing, and here's where things stand:
ATM_NCPL
and OCN_NCPL
were too small, and (ii) when I updated the maint-5.6
-> master
merge I had a typo in the JRA -> 0.1 degree runoff map file names. Those are addressed in https://github.com/ESMCI/cime/pull/3515 but that is still in "draft" status until I can do more thorough testing and verify that we want the updated NCPL
values in CIMENaN
s for some tracer tendencies on the first time step. Details in https://zulip2.cloud.ucar.edu/#narrow/stream/20-0.2E1.C2.B0-JRA.20BGC.20Run/topic/First.20Test.20Runs/near/7385CICE
(thanks, Dave!), so using https://github.com/ESCOMP/CESM_CICE5/releases/tag/cice5_20200430 will give us CICE4 physics and the necessary patches in ice_import_export.F90
out of the box. Alper's run has a few small mods that still need to be included: (i) turning off ideal age by appending -trage 0
to CICE_CONFIG_OPTS
, and (ii) three namelist changes: ndtd=2
, r_snw=1.00
, and f_blkmask = .true.
At this point, I'd welcome feedback on the CIME PR: do we want the NCPL
updates on master
? Is there any reason to hold off and see if additional CIME changes are made, or should we get this merged ASAP and submit future PRs down the line as needed? I'm leaning towards getting it merged quickly, and then relying on cesm2_2_beta05
as the basis for our experiment tag since that will also include our CICE mods.
I haven't had a chance to really dig into why MARBL is returning those NaNs. I'll try to find some time for it next week, but would appreciate any off-the-cuff theories that would be worth investigating (read: I'm not really sure where to start digging)
Regarding the NaNs, I've cloned Mike's case and am trying out some hypotheses.
Based on the tracers that do and don't have NaN tendencies, I was guessing that the NaN generation is in the sinking particulate code.
The variables PO[CNP]remin_refract
control how much remineralized POM goes into refractory DOM.
I tried a run with those set to zero (via SourceMods) and the refractory DOM tracer tendencies are no longer NaN.
I think this confirms my guess. I'm now reviewing the sinking particulate code to try to figure out what might be going wrong.
Thanks @Keith Lindsay -- if it would help to talk any of this out, I'm available to be a sounding board until ~5:00 today.
The following patch gets my test run past these NaNs:
--- /glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/cime/../components/pop/source/ecosys_driver.F90 2020-05-01 16:01:16.168534401 -0600 +++ SourceMods/src.pop/ecosys_driver.F90 2020-05-01 16:18:36.090587538 -0600 @@ -710,7 +710,7 @@ subroutine ecosys_driver_set_interior( marbl_instances(bid)%domain%kmt = KMT(i, c, bid) if (partial_bottom_cells) then - marbl_instances(bid)%domain%delta_z(:) = DZT(i, c, 1:km, bid) + marbl_instances(bid)%domain%delta_z(1:km) = DZT(i, c, 1:km, bid) end if ! --- set forcing fields --- @@ -790,6 +790,9 @@ subroutine ecosys_driver_set_interior( end if end associate end do + if (partial_bottom_cells) then + write(stdout, *) 'delta_z = ', marbl_instances(bid)%domain%delta_z(k) + end if call exit_POP(sigAbort, 'Stopping in ' // subname) end if end do
However, the test run still aborts, with another generated NaN:
1195: ecosys_driver:ecosys_driver_set_interior: NaN in dtracer_module, (i,j,k)=( 1195: 1611 , 695 , 46 ) 1195: (lon,lat)=( 51.0500000000000 , -43.7616814221158 ) 1195: PO4 2.04789492842180 1.233632313668388E-011 1195: NO3 29.1541177738795 7.891054589657643E-010 1195: SiO3 80.6860058275754 5.874654000518209E-009 1195: NH4 1.136311860910701E-003 -5.897866983062583E-010 1195: Fe 7.018172115361605E-004 -1.206971895609777E-012 1195: Lig 1.318561169877044E-003 -3.305152602581477E-014 1195: O2 206.443621307117 -3.318239747001380E-009 1195: DIC 2288.10693532270 1.540772284427621E-009 1195: DIC_ALT_CO2 2288.10693532270 1.540772284427621E-009 1195: ALK 2410.24191924178 -1.247831255100442E-009 1195: ALK_ALT_CO2 2410.24191924178 -1.247831255100442E-009 1195: DOC 3.061239864193014E-013 -1.617854655099471E-021 1195: DON 1.529597999753868E-014 -8.818769889268647E-023 1195: DOP 2.035138542409879E-015 -1.434084885287981E-023 1195: DOPr 2.105337715235184E-002 -1.191824074063360E-013 1195: DONr 1.17315908596378 -3.848418582486065E-012 1195: DOCr 23.5750940541949 -4.586495690265037E-011 1195: zooC -2.145480931261657E-049 0.000000000000000E+000 1195: spChl -3.612966596092754E-051 0.000000000000000E+000 1195: spC -1.276973321602058E-050 0.000000000000000E+000 1195: spP -1.107089422179871E-052 0.000000000000000E+000 1195: spFe -5.156775088672139E-055 0.000000000000000E+000 1195: spCaCO3 -7.065586469539938E-052 0.000000000000000E+000 1195: diatChl -5.163524977252080E-057 0.000000000000000E+000 1195: diatC -1.550109443411441E-056 0.000000000000000E+000 1195: diatP -1.335046195172915E-058 0.000000000000000E+000 1195: diatFe -2.086869697838774E-061 0.000000000000000E+000 1195: diatSi -1.238294531751651E-054 0.000000000000000E+000 1195: diazChl 8.317163243079574E-296 NaN 1195: diazC 7.800604350267029E-158 -2.946006227443461E-164 1195: diazP 4.136366798894355E-156 0.000000000000000E+000 1195: diazFe 4.399229936014720E-178 0.000000000000000E+000 1195: Dust Flux 1195: 3.153111848031827E-012 1195: PAR Column Fraction 1195: 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000 1195: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 1195: Surface Shortwave 1195: 52.6980795125229 0.000000000000000E+000 0.000000000000000E+000 1195: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 1195: Potential Temperature 1195: 1.86141054800521 1195: Salinity 1195: 34.7509233308407 1195: Pressure 1195: 199.228289633652 1195: Iron Sediment Flux 1195: 0.000000000000000E+000 1195: delta_z = 21185.9570000000 1195:------------------------------------------------------------------------ 1195: 1195:POP aborting... 1195: Stopping in ecosys_driver:ecosys_driver_set_interior 1195: 1195:------------------------------------------------------------------------
Given how small diazChl is and that the tendency is only NaN for diazChl, this smells like a problem previously encountered, GH #50 where a denominator in an expression underflowed to zero, causing division by zero. I'll test out an idea for a fix.
My test run completes a day of execution with the following additional patch to MARBL:
--- /glade/work/mlevy/codes/CESM/cesm2_2_beta04+GECO_JRA_HR/cime/../components/pop/externals/MARBL/src/marbl_interior_tendency_mod.F90 2020-04-17 16:10:35.006494595 -0600 +++ SourceMods/src.pop/marbl_interior_tendency_mod.F90 2020-05-01 18:31:38.592387537 -0600 @@ -1330,6 +1330,7 @@ subroutine compute_autotroph_photosynthe real(r8) :: PCphoto_subcol ! PCphoto for a sub-column real(r8) :: pChl_subcol ! Chl synth. regulation term (mg Chl/mmol N) real(r8) :: photoacc_subcol ! photoacc for a sub-column + real(r8) :: work ! intermediate term common to multiple expressions !----------------------------------------------------------------------- associate( & @@ -1358,15 +1359,14 @@ subroutine compute_autotroph_photosynthe photoacc(auto_ind,k) = c0 do subcol_ind = 1, PAR_nsubcols - if (PAR_avg(k,subcol_ind) > c0) then - light_lim_subcol = (c1 - exp((-c1 * alphaPI(auto_ind) * thetaC(auto_ind,k) * PAR_avg(k,subcol_ind)) & - / (PCmax + epsTinv))) + work = alphaPI(auto_ind) * thetaC(auto_ind,k) * PAR_avg(k,subcol_ind) + if (work > c0) then + light_lim_subcol = (c1 - exp(-work / (PCmax + epsTinv))) PCphoto_subcol = PCmax * light_lim_subcol ! GD 98 Chl. synth. term - pChl_subcol = autotroph_settings(auto_ind)%thetaN_max * PCphoto_subcol & - / (alphaPI(auto_ind) * thetaC(auto_ind,k) * PAR_avg(k,subcol_ind)) + pChl_subcol = autotroph_settings(auto_ind)%thetaN_max * PCphoto_subcol / work photoacc_subcol = (pChl_subcol * PCphoto_subcol * Q / thetaC(auto_ind,k)) & * autotroph_local%Chl(auto_ind,k)
Thanks @Keith Lindsay ! One clarification -- it looks like your first diff (ecosys_driver.F90
) was compared against a base in my directory that was already partially repaired. The original line was marbl_instances(bid)%domain%delta_z(:) = DZT(i, c, :, bid)
, not marbl_instances(bid)%domain%delta_z(:) = DZT(i, c, 1:km, bid)
(I think I was starting to fix things in my base directory while we chatted, and then I forgot to revert the change when we decided you'd test the mods... so your diff picked up my patch as well). For everyone else following the thread, it turns out that the k
index of DZT
runs from 0:km+1
, and I had assumed it was 1:km
when updating delta_z
in the driver. Anyway, that fix is on the POP branch and I'll look at the MARBL fix on Monday.
Good catch @Michael Levy . I think I did the copy to my SourceMods prior to your edit and subsequently didn't notice that the diff was partial.
I've verified that @Keith Lindsay's MARBL mods are sufficient to pass the test -- I'm just running aux_pop_MARBL
to ensure they don't change answers. Currently generating baselines off development
; I hope to launch the comparison tests later today and have results / a new MARBL PR tomorrow.
Preliminary throughput data: https://docs.google.com/spreadsheets/d/17c-uLIceaKVDoHROTPA1WObqN-QwSDcRJE20K9vDfWA/edit?usp=sharing
Note that this is with POP output turned off and without coccolithophores; I'm going to add coccolithophores for my next test.
Also, the above numbers come from single runs rather than averaging many runs... so discrepancies due to which nodes we happened to be given on the machine may be distorting the numbers somewhat
Last updated: May 16 2025 at 17:14 UTC