intake-esm and smyle datasets · ESDS

It seems like there is a common issue with efficiently reading subsets of CESM-LE / SMYLE / other datasets. Daniel's efficient solution to this problem is being copy-pasted in quite a number of notebooks I've seen.

I think the sustainable solution forward is to move that logic into intake_esm (if it is not there already) and promote intake_esm more.

Daniel Kennedy (Apr 14 2021 at 18:11):

Will Wieder (Apr 14 2021 at 18:15):

This works for me.
Can we include some good documentation for how users should interface with the function in intake_esm?

Matt Long (Apr 14 2021 at 18:29):

      {
        "type": "join_new",
        "attribute_name": "start_date",
        "options": { "coords": "minimal", "compat": "override" }
      }

We worked thru aspects of this to support CMIP DP experiments, but I am not sure if any of the work was retained. @Anderson Banihirwe, do you remember?

Deepak Cherian (Apr 22 2021 at 22:55):

I posted this on the repo but it's private so I'm posting here too since it seems to be broadly useful.

The following version seems to work and is a single open_mfdataset call. It should be more feasible to write it as an intake-esm thing.

def preprocess_smyle(ds0):
    """ This is applied on an individual file basis."""
    d0 = ds0.isel(z_t=0).isel(time=slice(0, 24))

    # quick fix to adjust time vector for monthly data
    nmonths = len(d0.time)
    yr0 = d0["time.year"][0].values
    d0["time"] = xr.cftime_range(str(yr0), periods=nmonths, freq="MS")

    # DC: Can't assign M yet since we have a single file = 1 member
    # d0 = d0.assign_coords(M=("M", np.arange(sizes["M"])))
    d0 = d0.assign_coords(L=("time", np.arange(d0.sizes["time"])+1))
    d0 = d0.swap_dims({"time": "L"})
    d0 = d0.reset_coords(["time"])

    # DC: explicitly add the dimension Y so xarray knows to concatenate time variables also
    # if you know what Y should be, you could do `.expand_dims(Y=[year])`
    #    - d0.encoding["source"] is the file name, which might be useful for setting some
    #      coordinate values
    # you could also do this with TEMP to be really explicit about what you want.
    # Because you pass data_vars=["TEMP"], xarray will do expand_dims on TEMP for you.
    d0["time"] = d0.time.expand_dims("Y")
    d0["time_bound"] = d0.time_bound.expand_dims("Y")

    # DC: subset as in the original code but not necessary
    # does speed things up.
    return d0[["time", "time_bound", "TEMP", "TAREA", "UAREA"]]



file_list, yrs = nested_file_list_by_year(filetemplate, ens, field, firstyear, lastyear, startmonth)

ds0 = xr.open_mfdataset(
    file_list,
    combine="nested",
    # concat_dim depends on how file_list is ordered;
    # inner most list of datasets is combined along "M";
    # then the outer list is combined along "Y"
    concat_dim=["Y", "M"],
    parallel=True,
    data_vars=["TEMP"],
    coords="minimal",
    compat="override",
    preprocess=preprocess_smyle,
)

# assign final attributes
ds0["Y"] = yrs
ds0["M"] = np.arange(ds0.sizes["M"]) + 1

Stream: ESDS

Topic: intake-esm and smyle datasets

Deepak Cherian (Apr 14 2021 at 18:03):

Daniel Kennedy (Apr 14 2021 at 18:11):

Will Wieder (Apr 14 2021 at 18:15):

Matt Long (Apr 14 2021 at 18:29):

Deepak Cherian (Apr 22 2021 at 22:55):