Stream: ESDS

Topic: intake-esm and smyle datasets


view this post on Zulip Deepak Cherian (Apr 14 2021 at 18:03):

It seems like there is a common issue with efficiently reading subsets of CESM-LE / SMYLE / other datasets. Daniel's efficient solution to this problem is being copy-pasted in quite a number of notebooks I've seen.

I think the sustainable solution forward is to move that logic into intake_esm (if it is not there already) and promote intake_esm more.

cc @Stephen Yeager @Will Wieder @Daniel Kennedy @xdev

view this post on Zulip Daniel Kennedy (Apr 14 2021 at 18:11):

Here's where I have some code to access the CESM-LE: https://github.com/djk2120/cesm-lens/blob/main/notebooks/lens_template.ipynb

view this post on Zulip Will Wieder (Apr 14 2021 at 18:15):

This works for me.
Can we include some good documentation for how users should interface with the function in intake_esm?

view this post on Zulip Matt Long (Apr 14 2021 at 18:29):

@Max Grover put together an example on inake_esm last week:
https://ncar.github.io/esds/posts/intake_esm_dask/

The package is documented here:
https://intake-esm.readthedocs.io/en/latest/

To make this work for SMYLE/DPLE, we need to ensure that we have a working spec:
https://github.com/NCAR/esm-collection-spec

This needs an entry in the aggregation_control section of something like

      {
        "type": "join_new",
        "attribute_name": "start_date",
        "options": { "coords": "minimal", "compat": "override" }
      }

We worked thru aspects of this to support CMIP DP experiments, but I am not sure if any of the work was retained. @Anderson Banihirwe, do you remember?

Existing catalogs:
https://intake-esm.readthedocs.io/en/latest/supplemental-guide/cmip_ap.html#available-catalogs-at-ncar

view this post on Zulip Deepak Cherian (Apr 22 2021 at 22:55):

I posted this on the repo but it's private so I'm posting here too since it seems to be broadly useful.

The following version seems to work and is a single open_mfdataset call. It should be more feasible to write it as an intake-esm thing.

def preprocess_smyle(ds0):
    """ This is applied on an individual file basis."""
    d0 = ds0.isel(z_t=0).isel(time=slice(0, 24))

    # quick fix to adjust time vector for monthly data
    nmonths = len(d0.time)
    yr0 = d0["time.year"][0].values
    d0["time"] = xr.cftime_range(str(yr0), periods=nmonths, freq="MS")

    # DC: Can't assign M yet since we have a single file = 1 member
    # d0 = d0.assign_coords(M=("M", np.arange(sizes["M"])))
    d0 = d0.assign_coords(L=("time", np.arange(d0.sizes["time"])+1))
    d0 = d0.swap_dims({"time": "L"})
    d0 = d0.reset_coords(["time"])

    # DC: explicitly add the dimension Y so xarray knows to concatenate time variables also
    # if you know what Y should be, you could do `.expand_dims(Y=[year])`
    #    - d0.encoding["source"] is the file name, which might be useful for setting some
    #      coordinate values
    # you could also do this with TEMP to be really explicit about what you want.
    # Because you pass data_vars=["TEMP"], xarray will do expand_dims on TEMP for you.
    d0["time"] = d0.time.expand_dims("Y")
    d0["time_bound"] = d0.time_bound.expand_dims("Y")

    # DC: subset as in the original code but not necessary
    # does speed things up.
    return d0[["time", "time_bound", "TEMP", "TAREA", "UAREA"]]



file_list, yrs = nested_file_list_by_year(filetemplate, ens, field, firstyear, lastyear, startmonth)

ds0 = xr.open_mfdataset(
    file_list,
    combine="nested",
    # concat_dim depends on how file_list is ordered;
    # inner most list of datasets is combined along "M";
    # then the outer list is combined along "Y"
    concat_dim=["Y", "M"],
    parallel=True,
    data_vars=["TEMP"],
    coords="minimal",
    compat="override",
    preprocess=preprocess_smyle,
)

# assign final attributes
ds0["Y"] = yrs
ds0["M"] = np.arange(ds0.sizes["M"]) + 1

Last updated: Jan 30 2022 at 12:01 UTC