Stream: ESDS
Topic: intake-esm and smyle datasets
Deepak Cherian (Apr 14 2021 at 18:03):
It seems like there is a common issue with efficiently reading subsets of CESM-LE / SMYLE / other datasets. Daniel's efficient solution to this problem is being copy-pasted in quite a number of notebooks I've seen.
I think the sustainable solution forward is to move that logic into intake_esm (if it is not there already) and promote intake_esm more.
cc @Stephen Yeager @Will Wieder @Daniel Kennedy @xdev
Daniel Kennedy (Apr 14 2021 at 18:11):
Here's where I have some code to access the CESM-LE: https://github.com/djk2120/cesm-lens/blob/main/notebooks/lens_template.ipynb
Will Wieder (Apr 14 2021 at 18:15):
This works for me.
Can we include some good documentation for how users should interface with the function in intake_esm?
Matt Long (Apr 14 2021 at 18:29):
@Max Grover put together an example on inake_esm
last week:
https://ncar.github.io/esds/posts/intake_esm_dask/
The package is documented here:
https://intake-esm.readthedocs.io/en/latest/
To make this work for SMYLE/DPLE, we need to ensure that we have a working spec:
https://github.com/NCAR/esm-collection-spec
This needs an entry in the aggregation_control
section of something like
{ "type": "join_new", "attribute_name": "start_date", "options": { "coords": "minimal", "compat": "override" } }
We worked thru aspects of this to support CMIP DP experiments, but I am not sure if any of the work was retained. @Anderson Banihirwe, do you remember?
Existing catalogs:
https://intake-esm.readthedocs.io/en/latest/supplemental-guide/cmip_ap.html#available-catalogs-at-ncar
Deepak Cherian (Apr 22 2021 at 22:55):
I posted this on the repo but it's private so I'm posting here too since it seems to be broadly useful.
The following version seems to work and is a single open_mfdataset
call. It should be more feasible to write it as an intake-esm thing.
def preprocess_smyle(ds0): """ This is applied on an individual file basis.""" d0 = ds0.isel(z_t=0).isel(time=slice(0, 24)) # quick fix to adjust time vector for monthly data nmonths = len(d0.time) yr0 = d0["time.year"][0].values d0["time"] = xr.cftime_range(str(yr0), periods=nmonths, freq="MS") # DC: Can't assign M yet since we have a single file = 1 member # d0 = d0.assign_coords(M=("M", np.arange(sizes["M"]))) d0 = d0.assign_coords(L=("time", np.arange(d0.sizes["time"])+1)) d0 = d0.swap_dims({"time": "L"}) d0 = d0.reset_coords(["time"]) # DC: explicitly add the dimension Y so xarray knows to concatenate time variables also # if you know what Y should be, you could do `.expand_dims(Y=[year])` # - d0.encoding["source"] is the file name, which might be useful for setting some # coordinate values # you could also do this with TEMP to be really explicit about what you want. # Because you pass data_vars=["TEMP"], xarray will do expand_dims on TEMP for you. d0["time"] = d0.time.expand_dims("Y") d0["time_bound"] = d0.time_bound.expand_dims("Y") # DC: subset as in the original code but not necessary # does speed things up. return d0[["time", "time_bound", "TEMP", "TAREA", "UAREA"]] file_list, yrs = nested_file_list_by_year(filetemplate, ens, field, firstyear, lastyear, startmonth) ds0 = xr.open_mfdataset( file_list, combine="nested", # concat_dim depends on how file_list is ordered; # inner most list of datasets is combined along "M"; # then the outer list is combined along "Y" concat_dim=["Y", "M"], parallel=True, data_vars=["TEMP"], coords="minimal", compat="override", preprocess=preprocess_smyle, ) # assign final attributes ds0["Y"] = yrs ds0["M"] = np.arange(ds0.sizes["M"]) + 1
Last updated: Jan 30 2022 at 12:01 UTC